A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

January 7, 2011 | Guillaume Marçais1,* and Carl Kingsford2
The paper introduces Jellyfish, a fast and memory-efficient k-mer counting algorithm designed for parallel computing. K-mers are substrings of length \( k \) that are crucial in various applications such as genome assembly, error correction, and repeat detection. The increasing amount of sequence data generated by next-generation sequencing technologies has made current k-mer counting tools slow and memory-intensive. Jellyfish addresses this issue by using a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. The algorithm leverages the 'compare-and-swap' (CAS) instruction, a widely available hardware operation, to implement efficient shared access to the data structures. This design allows Jellyfish to achieve significantly faster performance compared to traditional k-mer counting methods, with a speedup of an order of magnitude or more. Additionally, Jellyfish is highly memory-efficient, using a constant amount of memory per key in the hash table, regardless of the length of the k-mers. The software is written in C++ and is available under the GPL license. Experimental results demonstrate that Jellyfish can process large genomic datasets, including those from high-coverage sequencing projects, in a matter of minutes, making it a valuable tool for biological data analysis.The paper introduces Jellyfish, a fast and memory-efficient k-mer counting algorithm designed for parallel computing. K-mers are substrings of length \( k \) that are crucial in various applications such as genome assembly, error correction, and repeat detection. The increasing amount of sequence data generated by next-generation sequencing technologies has made current k-mer counting tools slow and memory-intensive. Jellyfish addresses this issue by using a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. The algorithm leverages the 'compare-and-swap' (CAS) instruction, a widely available hardware operation, to implement efficient shared access to the data structures. This design allows Jellyfish to achieve significantly faster performance compared to traditional k-mer counting methods, with a speedup of an order of magnitude or more. Additionally, Jellyfish is highly memory-efficient, using a constant amount of memory per key in the hash table, regardless of the length of the k-mers. The software is written in C++ and is available under the GPL license. Experimental results demonstrate that Jellyfish can process large genomic datasets, including those from high-coverage sequencing projects, in a matter of minutes, making it a valuable tool for biological data analysis.
Reach us at info@study.space