CD-HIT: accelerated for clustering the next-generation sequencing data

CD-HIT: accelerated for clustering the next-generation sequencing data

October 11, 2012 | Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li
CD-HIT is a widely used program for clustering biological sequences to reduce redundancy and improve the performance of other sequence analyses. In response to the rapid increase in next-generation sequencing data, a new CD-HIT program has been developed with a novel parallelization strategy and other techniques to efficiently cluster large datasets. The enhanced CD-HIT can handle very large datasets in much shorter time than previous versions. The program is available at http://cd-hit.org and can be contacted at liwz@sdsc.edu. CD-HIT is a greedy incremental algorithm that starts with the longest input sequence as the first cluster representative, then processes the remaining sequences from long to short to classify each sequence as redundant or representative based on its similarities to existing representatives. The similarities are estimated by common word counting using word indexing and counting tables to filter out unnecessary sequence alignments. To accelerate CD-HIT, the core steps have been simplified into two key procedures: a checking procedure and a clustering procedure. The algorithm requires at most two word tables without the need to swap them to disk. A parallelization technique has been proposed that uses two word tables and T-1 threads to run multiple checking procedures using one word table (an immutable checking table), and the remaining thread to run a single clustering procedure using the other table (a mutable clustering table) in parallel. The new CD-HIT includes other enhancements such as faster file reading, better filtering threshold estimation, more efficient word counting, and better alignment band estimation. The program is implemented in C++ and uses OpenMP for parallelization. The results show that the new CD-HIT is significantly more efficient than the old version and is comparable to or more efficient than UCLUST. When multi-cores are used, the new CD-HIT is much more efficient than either of them. The program can cluster large datasets in hours on multi-core machines, which is a significant improvement over the previous days required. The enhanced CD-HIT is expected to find more applications in handling next-generation sequencing data.CD-HIT is a widely used program for clustering biological sequences to reduce redundancy and improve the performance of other sequence analyses. In response to the rapid increase in next-generation sequencing data, a new CD-HIT program has been developed with a novel parallelization strategy and other techniques to efficiently cluster large datasets. The enhanced CD-HIT can handle very large datasets in much shorter time than previous versions. The program is available at http://cd-hit.org and can be contacted at liwz@sdsc.edu. CD-HIT is a greedy incremental algorithm that starts with the longest input sequence as the first cluster representative, then processes the remaining sequences from long to short to classify each sequence as redundant or representative based on its similarities to existing representatives. The similarities are estimated by common word counting using word indexing and counting tables to filter out unnecessary sequence alignments. To accelerate CD-HIT, the core steps have been simplified into two key procedures: a checking procedure and a clustering procedure. The algorithm requires at most two word tables without the need to swap them to disk. A parallelization technique has been proposed that uses two word tables and T-1 threads to run multiple checking procedures using one word table (an immutable checking table), and the remaining thread to run a single clustering procedure using the other table (a mutable clustering table) in parallel. The new CD-HIT includes other enhancements such as faster file reading, better filtering threshold estimation, more efficient word counting, and better alignment band estimation. The program is implemented in C++ and uses OpenMP for parallelization. The results show that the new CD-HIT is significantly more efficient than the old version and is comparable to or more efficient than UCLUST. When multi-cores are used, the new CD-HIT is much more efficient than either of them. The program can cluster large datasets in hours on multi-core machines, which is a significant improvement over the previous days required. The enhanced CD-HIT is expected to find more applications in handling next-generation sequencing data.
Reach us at info@study.space
[slides and audio] CD-HIT%3A accelerated for clustering the next-generation sequencing data