Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

2013 | Heng Li
BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference genome. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-the-art read aligners. BWA-MEM is implemented as a component of BWA, available at https://github.com/lh3/bwa. Most short-read mappers were developed for 36bp reads, but with improved NGS technologies, reads are longer, posing new challenges. For 100bp or longer reads, it is important to allow long gaps and report multiple non-overlapping local hits. Many short-read alignment algorithms are not suitable for longer reads. Existing long-read alignment algorithms have limitations, such as slower speed or lack of features for large-scale NGS data. This motivated the development of BWA-MEM. BWA-MEM uses a seed-and-extend approach. It initially seeds alignments with supermaximal exact matches (SMEMs), and re-seeds if necessary. Chains of seeds are formed and filtered to reduce unsuccessful extensions. Seed extension uses banded affine-gap-penalty dynamic programming. BWA-MEM stops extension if the best score is not significantly better than the current score. It also tracks the best extension score to choose between local and end-to-end alignments. For paired-end reads, BWA-MEM rescues missing hits using SSE2-based Smith-Waterman alignment. It pairs hits based on alignment scores, insert size, and the possibility of chimeric reads. BWA-MEM performs well on simulated data, showing better accuracy than NovoAlign and comparable performance to GEM and Cushaw2. It is faster than Bowtie2 and Cushaw2 for long reads. BWA-MEM is also able to identify chimeric reads, a crucial feature for contig alignment. BWA-MEM is fast and accurate for sequence reads and works well for both short and long sequences. It is suitable for large genomes and can be made faster with SSE2-based banded DP and restricted DP. Seeding is the bottleneck for short sequences, while banded DP is the bottleneck for long sequences.BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference genome. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-the-art read aligners. BWA-MEM is implemented as a component of BWA, available at https://github.com/lh3/bwa. Most short-read mappers were developed for 36bp reads, but with improved NGS technologies, reads are longer, posing new challenges. For 100bp or longer reads, it is important to allow long gaps and report multiple non-overlapping local hits. Many short-read alignment algorithms are not suitable for longer reads. Existing long-read alignment algorithms have limitations, such as slower speed or lack of features for large-scale NGS data. This motivated the development of BWA-MEM. BWA-MEM uses a seed-and-extend approach. It initially seeds alignments with supermaximal exact matches (SMEMs), and re-seeds if necessary. Chains of seeds are formed and filtered to reduce unsuccessful extensions. Seed extension uses banded affine-gap-penalty dynamic programming. BWA-MEM stops extension if the best score is not significantly better than the current score. It also tracks the best extension score to choose between local and end-to-end alignments. For paired-end reads, BWA-MEM rescues missing hits using SSE2-based Smith-Waterman alignment. It pairs hits based on alignment scores, insert size, and the possibility of chimeric reads. BWA-MEM performs well on simulated data, showing better accuracy than NovoAlign and comparable performance to GEM and Cushaw2. It is faster than Bowtie2 and Cushaw2 for long reads. BWA-MEM is also able to identify chimeric reads, a crucial feature for contig alignment. BWA-MEM is fast and accurate for sequence reads and works well for both short and long sequences. It is suitable for large genomes and can be made faster with SSE2-based banded DP and restricted DP. Seeding is the bottleneck for short sequences, while banded DP is the bottleneck for long sequences.
Reach us at info@study.space