[slides and audio] A survey of sequence alignment algorithms for next-generation sequencing

A survey of sequence alignment algorithms for next-generation sequencing Heng Li and Nils Homer Abstract: Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. This article systematically reviews the current development of these algorithms and introduces their practical applications on different types of experimental data. It concludes that short-read alignment is no longer the bottleneck of data analyses. It also considers future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing. Keywords: new sequencing technologies; alignment algorithm; sequence analysis Introduction: The rapid development of new sequencing technologies substantially extends the scale and resolution of many biological applications, including the scan of genome-wide variation, identification of protein binding sites (ChIP-seq), quantitative analysis of transcriptome (RNA-seq), the study of the genome-wide methylation pattern, and the assembly of new genomes or transcriptomes. Most of these applications take alignment or de novo assembly as the first step; even in de novo assembly, sequence reads may still need to be aligned back to the assembly as most large-scale short-read assemblers do not track the location of each individual read. Sequence alignment is therefore essential to nearly all the applications of new sequencing technologies. All new sequencing technologies in production, including Roche/454, Illumina, SOLiD and Helicos, are able to produce data of the order of 90-400 base-pairs (Gbp) per machine day. With the emergence of such data, researchers have quickly realized that even the best tools for aligning capillary reads are not efficient enough given the unprecedented amount of data. To keep pace with the throughput of sequencing technologies, many new alignment tools have been developed in the last two years. These tools exploit the many advantages specific to each new sequencing technology, such as the short sequence length of Illumina, SOLiD and Helicos reads, the di-base encoding of SOLiD reads, the high base quality towards the 5'-end of Illumina and 454 reads, the low indel error rate of Illumina reads and the low substitution error rate of Helicos reads. Short read aligners outperform traditional aligners in terms of both speed and accuracy. They greatly boost the applications of new sequencing technologies as well as the theoretical studies of alignment algorithms. This article aims to systematically review the recent advance with respect to alignment algorithms. It is organized as follows. We first review the progress on general alignment techniques and then examine their applications in the context of specific sequencing platforms and experimental designs. We will use simulated data to evaluate the necessity of gapped alignment and paired-end mapping, and present a list of alignment software that are actively maintained and widely used. Finally, we will discuss the future development of alignmentA survey of sequence alignment algorithms for next-generation sequencing Heng Li and Nils Homer Abstract: Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. This article systematically reviews the current development of these algorithms and introduces their practical applications on different types of experimental data. It concludes that short-read alignment is no longer the bottleneck of data analyses. It also considers future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing. Keywords: new sequencing technologies; alignment algorithm; sequence analysis Introduction: The rapid development of new sequencing technologies substantially extends the scale and resolution of many biological applications, including the scan of genome-wide variation, identification of protein binding sites (ChIP-seq), quantitative analysis of transcriptome (RNA-seq), the study of the genome-wide methylation pattern, and the assembly of new genomes or transcriptomes. Most of these applications take alignment or de novo assembly as the first step; even in de novo assembly, sequence reads may still need to be aligned back to the assembly as most large-scale short-read assemblers do not track the location of each individual read. Sequence alignment is therefore essential to nearly all the applications of new sequencing technologies. All new sequencing technologies in production, including Roche/454, Illumina, SOLiD and Helicos, are able to produce data of the order of 90-400 base-pairs (Gbp) per machine day. With the emergence of such data, researchers have quickly realized that even the best tools for aligning capillary reads are not efficient enough given the unprecedented amount of data. To keep pace with the throughput of sequencing technologies, many new alignment tools have been developed in the last two years. These tools exploit the many advantages specific to each new sequencing technology, such as the short sequence length of Illumina, SOLiD and Helicos reads, the di-base encoding of SOLiD reads, the high base quality towards the 5'-end of Illumina and 454 reads, the low indel error rate of Illumina reads and the low substitution error rate of Helicos reads. Short read aligners outperform traditional aligners in terms of both speed and accuracy. They greatly boost the applications of new sequencing technologies as well as the theoretical studies of alignment algorithms. This article aims to systematically review the recent advance with respect to alignment algorithms. It is organized as follows. We first review the progress on general alignment techniques and then examine their applications in the context of specific sequencing platforms and experimental designs. We will use simulated data to evaluate the necessity of gapped alignment and paired-end mapping, and present a list of alignment software that are actively maintained and widely used. Finally, we will discuss the future development of alignment

A survey of sequence alignment algorithms for next-generation sequencing

11 May 2010 | Heng Li and Nils Homer