This paper presents an efficient architecture-aware implementation of BWA-MEM for multicore systems that maintains identical output, enabling seamless replacement of the original tool. BWA-MEM is a widely used tool for sequence mapping, accounting for over 30% of the overall time in the GATK best practices workflow. The main challenge is to accelerate BWA-MEM while maintaining identical output, as many genomics studies require consistent results over long periods.
The BWA-MEM algorithm consists of three key kernels: SMEM (super maximal exact matches), SAL (suffix array lookup), and BSW (banded Smith-Waterman). These kernels collectively account for over 85% of the total compute time. The authors achieved significant speedups by optimizing these kernels through techniques such as improving cache reuse, simplifying algorithms, replacing small memory allocations with large contiguous ones, software prefetching, and utilizing SIMD instructions. The results show speedups of 2×, 183×, and 8× on the three kernels, leading to overall speedups of 3.5× and 2.4× on end-to-end compute time on a single thread and single socket of an Intel Xeon Skylake processor.
The optimizations were implemented using a reorganized workflow that enables SIMD parallelism across different reads in a batch. The authors also improved inefficient memory allocation by allocating all required memory once in large contiguous blocks, which improves hardware prefetching and cache reuse. For the SMEM kernel, they optimized the FM-index structure and applied software prefetching to reduce memory latency. For the SAL kernel, they simplified the algorithm and used the uncompressed suffix array to reduce the number of instructions. For the BSW kernel, they vectorized the implementation using SIMD parallelism and applied inter-task vectorization to improve parallelism.
The results show that the optimized BWA-MEM achieves the highest reported speedup over BWA-MEM while using a single CPU or a single CPU-single GPGPU/FPGA combination. The implementation is open-sourced, allowing users of BWA-MEM to benefit from increased performance. The authors conclude that their approach provides a significant performance improvement while maintaining identical output, making it a valuable tool for genomics research.This paper presents an efficient architecture-aware implementation of BWA-MEM for multicore systems that maintains identical output, enabling seamless replacement of the original tool. BWA-MEM is a widely used tool for sequence mapping, accounting for over 30% of the overall time in the GATK best practices workflow. The main challenge is to accelerate BWA-MEM while maintaining identical output, as many genomics studies require consistent results over long periods.
The BWA-MEM algorithm consists of three key kernels: SMEM (super maximal exact matches), SAL (suffix array lookup), and BSW (banded Smith-Waterman). These kernels collectively account for over 85% of the total compute time. The authors achieved significant speedups by optimizing these kernels through techniques such as improving cache reuse, simplifying algorithms, replacing small memory allocations with large contiguous ones, software prefetching, and utilizing SIMD instructions. The results show speedups of 2×, 183×, and 8× on the three kernels, leading to overall speedups of 3.5× and 2.4× on end-to-end compute time on a single thread and single socket of an Intel Xeon Skylake processor.
The optimizations were implemented using a reorganized workflow that enables SIMD parallelism across different reads in a batch. The authors also improved inefficient memory allocation by allocating all required memory once in large contiguous blocks, which improves hardware prefetching and cache reuse. For the SMEM kernel, they optimized the FM-index structure and applied software prefetching to reduce memory latency. For the SAL kernel, they simplified the algorithm and used the uncompressed suffix array to reduce the number of instructions. For the BSW kernel, they vectorized the implementation using SIMD parallelism and applied inter-task vectorization to improve parallelism.
The results show that the optimized BWA-MEM achieves the highest reported speedup over BWA-MEM while using a single CPU or a single CPU-single GPGPU/FPGA combination. The implementation is open-sourced, allowing users of BWA-MEM to benefit from increased performance. The authors conclude that their approach provides a significant performance improvement while maintaining identical output, making it a valuable tool for genomics research.