2010 June | Jason R. Miller, Sergey Koren, and Granger Sutton
The article reviews assembly algorithms for next-generation sequencing (NGS) data, focusing on de novo whole-genome shotgun assembly. It compares various assembly methods, including the de Bruijn graph (DBG) approach and the overlap/layout/consensus (OLC) approach. NGS data from platforms like 454, Illumina, and SOLiD have shorter read lengths, higher coverage, and different error profiles compared to Sanger sequencing. Several assembly software packages, such as SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo, have been developed for NGS data. The DBG approach uses K-mer graphs to represent sequences, while the OLC approach relies on overlap graphs. The DBG approach is more sensitive to repeats and sequencing errors, while the OLC approach is more sensitive to overlap length and minimum identity. The article also discusses challenges in assembly, including repeat resolution, sequencing error, and non-uniform coverage. It highlights the importance of heuristics and approximation algorithms in overcoming these challenges. The article concludes by summarizing the key features and performance of various assembly algorithms for NGS data.The article reviews assembly algorithms for next-generation sequencing (NGS) data, focusing on de novo whole-genome shotgun assembly. It compares various assembly methods, including the de Bruijn graph (DBG) approach and the overlap/layout/consensus (OLC) approach. NGS data from platforms like 454, Illumina, and SOLiD have shorter read lengths, higher coverage, and different error profiles compared to Sanger sequencing. Several assembly software packages, such as SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo, have been developed for NGS data. The DBG approach uses K-mer graphs to represent sequences, while the OLC approach relies on overlap graphs. The DBG approach is more sensitive to repeats and sequencing errors, while the OLC approach is more sensitive to overlap length and minimum identity. The article also discusses challenges in assembly, including repeat resolution, sequencing error, and non-uniform coverage. It highlights the importance of heuristics and approximation algorithms in overcoming these challenges. The article concludes by summarizing the key features and performance of various assembly algorithms for NGS data.