This paper presents an efficient architecture-aware implementation of BWA-MEM, a widely used tool for sequence mapping, to accelerate its performance on multicore systems. The authors focus on improving the three key kernels—SMEM, SAL, and BSW—responsible for over 85% of the overall compute time. They achieve significant speedups by enhancing cache reuse, simplifying algorithms, optimizing memory allocation, using software prefetching, and leveraging SIMD instructions. The optimizations result in a 2×, 183×, and 8× speedup for the three kernels, respectively, leading to up to 3.5× and 2.4× speedups on end-to-end compute time on a single thread and single socket of an Intel Xeon Skylake processor. The optimized implementation maintains identical output to the original BWA-MEM, making it a seamless replacement. The paper also includes detailed performance analysis and comparisons with related work, highlighting the effectiveness of the proposed optimizations.This paper presents an efficient architecture-aware implementation of BWA-MEM, a widely used tool for sequence mapping, to accelerate its performance on multicore systems. The authors focus on improving the three key kernels—SMEM, SAL, and BSW—responsible for over 85% of the overall compute time. They achieve significant speedups by enhancing cache reuse, simplifying algorithms, optimizing memory allocation, using software prefetching, and leveraging SIMD instructions. The optimizations result in a 2×, 183×, and 8× speedup for the three kernels, respectively, leading to up to 3.5× and 2.4× speedups on end-to-end compute time on a single thread and single socket of an Intel Xeon Skylake processor. The optimized implementation maintains identical output to the original BWA-MEM, making it a seamless replacement. The paper also includes detailed performance analysis and comparisons with related work, highlighting the effectiveness of the proposed optimizations.