Advance Access publication October 11, 2012 | Limin Fu, Beifang Niu, Zhengwei Zhu†, Sitao Wu and Weizhong Li*
CD-HIT is a widely used program for clustering biological sequences to reduce redundancy and improve sequence analysis performance. To address the rapid increase in next-generation sequencing data, the authors developed an enhanced version of CD-HIT that incorporates a novel parallelization strategy and other techniques to efficiently handle large datasets. The new CD-HIT demonstrates significant speed improvements, achieving up to ~24 cores speedup and quasi-linear speedup for up to ~8 cores. The enhanced version can handle very large datasets in much shorter times compared to previous versions. The parallelization technique involves using two word tables and multiple threads to run checking and clustering procedures in parallel, ensuring proper grouping and scheduling of input sequences. Additional enhancements include faster file reading, better filtering threshold estimation, more efficient word counting, and improved alignment band estimation. The new CD-HIT is implemented in C++ using OpenMP for parallelization and has been tested on various protein and DNA sequence datasets, showing superior performance compared to the previous version and a similar program, UCLUST. The enhanced CD-HIT is expected to find broader applications in handling next-generation sequencing data.CD-HIT is a widely used program for clustering biological sequences to reduce redundancy and improve sequence analysis performance. To address the rapid increase in next-generation sequencing data, the authors developed an enhanced version of CD-HIT that incorporates a novel parallelization strategy and other techniques to efficiently handle large datasets. The new CD-HIT demonstrates significant speed improvements, achieving up to ~24 cores speedup and quasi-linear speedup for up to ~8 cores. The enhanced version can handle very large datasets in much shorter times compared to previous versions. The parallelization technique involves using two word tables and multiple threads to run checking and clustering procedures in parallel, ensuring proper grouping and scheduling of input sequences. Additional enhancements include faster file reading, better filtering threshold estimation, more efficient word counting, and improved alignment band estimation. The new CD-HIT is implemented in C++ using OpenMP for parallelization and has been tested on various protein and DNA sequence datasets, showing superior performance compared to the previous version and a similar program, UCLUST. The enhanced CD-HIT is expected to find broader applications in handling next-generation sequencing data.