🤖 AI Summary
This work addresses decoding errors in hybrid-base DNA data storage caused by sequencing proportion distortion in composite symbols. Confronting challenges—including the tight coupling of biological constraints (run-length limits, GC-content requirements) with arbitrary base mixing ratios, high redundancy, and the absence of capacity bounds—we first establish an information-theoretic capacity bound for the composite DNA channel. We propose a capacity-approaching coding scheme requiring only one redundant symbol, enabling decoupled design of constraints and base mixing ratios. Leveraging finite-state machine encoding, graph-theoretic construction, and combinatorial optimization, we develop a low-redundancy, highly robust encoder–decoder. Simulation results demonstrate substantial improvement in composite-symbol decoding accuracy. Our framework provides both theoretical foundations and practical tools for efficient, biologically compliant DNA-based data storage.
📝 Abstract
Composite DNA is a recent novel method to increase the information capacity of DNA-based data storage above the theoretical limit of 2 bits/symbol. In this method, every composite symbol does not store a single DNA nucleotide but a mixture of the four nucleotides in a predetermined ratio. By using different mixtures and ratios, the alphabet can be extended to have much more than four symbols in the naive approach. While this method enables higher data content per synthesis cycle, potentially reducing the DNA synthesis cost, it also imposes significant challenges for accurate DNA sequencing since the base-level errors can easily change the mixture of bases and their ratio, resulting in changes to the composite symbols. With this motivation, we propose efficient constrained coding techniques to enforce the biological constraints, including the runlength-limited constraint and the GC-content constraint, into every DNA synthesized oligo, regardless of the mixture of bases in each composite letter and their corresponding ratio. Our goals include computing the capacity of the constrained channel, constructing efficient encoders/decoders, and providing the best options for the composite letters to obtain capacity-approaching codes. For certain codes' parameters, our methods incur only one redundant symbol.