Understanding Sequencing coverage analysis for combinatorial DNA-based storage systems

This paper introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage systems using combinatorial DNA encoding. The study focuses on the distribution of the number of sequencing reads required for message reconstruction, using a variant of the coupon collector distribution. A Markov Chain representation is used to compute the probability of error-free reconstruction, and theoretical bounds on decoding probability are developed. Empirical simulations validate these bounds and assess their tightness. The work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. A Python package is provided, which calculates the required read coverage to guarantee message reconstruction with a desired confidence level. The paper also presents a tool for determining the required sequencing coverage based on design parameters and confidence levels. The analysis is broken down into three steps: decoding a single combinatorial letter, decoding a combinatorial sequence, and decoding a complete combinatorial message. The study demonstrates that increasing the desired confidence level requires higher sequencing coverage, while increasing redundancy levels reduces the number of reads needed. The results show that the proposed model is more efficient and scalable compared to Monte Carlo simulations. The paper also discusses the importance of carefully selecting system parameters to optimize the efficiency and reliability of DNA-based data storage systems.This paper introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage systems using combinatorial DNA encoding. The study focuses on the distribution of the number of sequencing reads required for message reconstruction, using a variant of the coupon collector distribution. A Markov Chain representation is used to compute the probability of error-free reconstruction, and theoretical bounds on decoding probability are developed. Empirical simulations validate these bounds and assess their tightness. The work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. A Python package is provided, which calculates the required read coverage to guarantee message reconstruction with a desired confidence level. The paper also presents a tool for determining the required sequencing coverage based on design parameters and confidence levels. The analysis is broken down into three steps: decoding a single combinatorial letter, decoding a combinatorial sequence, and decoding a complete combinatorial message. The study demonstrates that increasing the desired confidence level requires higher sequencing coverage, while increasing redundancy levels reduces the number of reads needed. The results show that the proposed model is more efficient and scalable compared to Monte Carlo simulations. The paper also discusses the importance of carefully selecting system parameters to optimize the efficiency and reliability of DNA-based data storage systems.

Sequencing Coverage Analysis for Combinatorial DNA-Based Storage Systems

June 2024 | Inbal Preuss, Ben Galili, Zohar Yakhini, and Leon Anavy