2024 | Inbal Preuss, Student Member, IEEE, Ben Galili, Zohar Yakhini, Member, IEEE, and Leon Anavy
This study introduces a novel model for analyzing the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding. The authors use a variant of the coupon collector distribution and a Markov Chain representation to characterize the distribution of the number of sequencing reads needed for message reconstruction. They develop theoretical bounds on the decoding probability and validate these bounds through empirical simulations. The work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. A Python package is provided to compute the required read coverage, ensuring message reconstruction with a specified confidence level. The study also explores the impact of various design parameters on the required coverage and compares the results with Monte Carlo simulations. The findings highlight the importance of carefully selecting system parameters to optimize efficiency and reliability.This study introduces a novel model for analyzing the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding. The authors use a variant of the coupon collector distribution and a Markov Chain representation to characterize the distribution of the number of sequencing reads needed for message reconstruction. They develop theoretical bounds on the decoding probability and validate these bounds through empirical simulations. The work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. A Python package is provided to compute the required read coverage, ensuring message reconstruction with a specified confidence level. The study also explores the impact of various design parameters on the required coverage and compares the results with Monte Carlo simulations. The findings highlight the importance of carefully selecting system parameters to optimize efficiency and reliability.