🤖 AI Summary
This study addresses the expected number of reads required to recover all encoded DNA strands in DNA data storage, a problem equivalent to determining the expected number of randomly sampled columns from a linear code’s generator matrix needed to achieve full rank. By introducing a duality argument and an extended weight enumerator, the authors develop a combinatorial framework that establishes a general relationship between coverage depth and the higher-order weight distribution of linear codes over field extensions. Leveraging this framework, they derive, for the first time, closed-form expressions for the coverage depth of several important codes—namely, simplex codes, Hamming codes, the ternary Golay code, its extended version, and first-order Reed–Muller codes—thereby significantly advancing the quantitative analysis of coding theory in the context of DNA-based storage systems.
📝 Abstract
The coverage depth problem in DNA data storage is about computing the expected number of reads needed to recover all encoded strands. Given a generator matrix of a linear code, this quantity equals the expected number of randomly drawn columns required to obtain full rank. While MDS codes are optimal when they exist, i.e., over large fields, practical scenarios may rely on structured code families defined over small fields. In this work, we develop combinatorial tools to solve the DNA coverage depth problem for various linear codes, based on duality arguments and the notion of extended weight enumerator. Using these methods, we derive closed formulas for the simplex, Hamming, ternary Golay, extended ternary Golay, and first-order Reed-Muller codes. The centerpiece of this paper is a general expression for the coverage depth of a linear code in terms of the weight distributions of its higher-field extensions.