[slides and audio] UniRef clusters%3A a comprehensive and scalable alternative for improving sequence similarity searches

The article presents an analysis of the UniRef databases, which are comprehensive and scalable alternatives to native sequence databases for improving sequence similarity searches. UniRef databases provide clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records, enabling complete coverage of sequence space at various resolutions while reducing redundancy. The UniRef100 database combines identical sequences and subfragments from any source organism into a single cluster. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at 90% and 50% sequence identity levels, respectively. These databases contain summary cluster and membership information, including the sequence of a representative protein, member count, common taxonomy, and links to functional annotations. The introduction of an 80% sequence length overlap threshold for the computation of UniRef90 and UniRef50 databases ensures that each member of a cluster has a minimum length overlap of 80% with the longest sequence in the cluster. This threshold improves intra-cluster molecular function consistency and prevents proteins sharing only partial sequences from being clustered together. The UniRef databases have been used for over a decade and are widely applied in functional annotation, family classification, systems biology, structural genomics, phylogenetic analysis, and mass spectrometry. The authors evaluated the performance of UniRef50-based sequence similarity searches against native sequence databases. Results showed that UniRef50-based searches are faster (approximately 6 times), more concise (7 times shorter hit list), and more sensitive in detecting remote similarities compared to UniProtKB-based searches. The precision and recall of UniRef50-based searches were also higher, with over 96% recall at an e-value of <0.0001. The UniRef50-based searches also provided access to information from corresponding clusters, such as GO annotations from individual members, and enabled the detection of more remote similarities for the query sequence. The analysis supports the use of UniRef databases as a powerful alternative to native sequence databases for similarity searches and in functional annotation. The results also highlight new uses for UniRef clusters, such as the correction of GO term annotations through the detection of intra-cluster molecular function incoherencies. UniRef clusters for any UniProtKB entry can be viewed under the 'similar protein' section of every entry on the UniProt website.The article presents an analysis of the UniRef databases, which are comprehensive and scalable alternatives to native sequence databases for improving sequence similarity searches. UniRef databases provide clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records, enabling complete coverage of sequence space at various resolutions while reducing redundancy. The UniRef100 database combines identical sequences and subfragments from any source organism into a single cluster. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at 90% and 50% sequence identity levels, respectively. These databases contain summary cluster and membership information, including the sequence of a representative protein, member count, common taxonomy, and links to functional annotations. The introduction of an 80% sequence length overlap threshold for the computation of UniRef90 and UniRef50 databases ensures that each member of a cluster has a minimum length overlap of 80% with the longest sequence in the cluster. This threshold improves intra-cluster molecular function consistency and prevents proteins sharing only partial sequences from being clustered together. The UniRef databases have been used for over a decade and are widely applied in functional annotation, family classification, systems biology, structural genomics, phylogenetic analysis, and mass spectrometry. The authors evaluated the performance of UniRef50-based sequence similarity searches against native sequence databases. Results showed that UniRef50-based searches are faster (approximately 6 times), more concise (7 times shorter hit list), and more sensitive in detecting remote similarities compared to UniProtKB-based searches. The precision and recall of UniRef50-based searches were also higher, with over 96% recall at an e-value of <0.0001. The UniRef50-based searches also provided access to information from corresponding clusters, such as GO annotations from individual members, and enabled the detection of more remote similarities for the query sequence. The analysis supports the use of UniRef databases as a powerful alternative to native sequence databases for similarity searches and in functional annotation. The results also highlight new uses for UniRef clusters, such as the correction of GO term annotations through the detection of intra-cluster molecular function incoherencies. UniRef clusters for any UniProtKB entry can be viewed under the 'similar protein' section of every entry on the UniProt website.

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

2015 | Baris E. Suzek, Yuqi Wang, Hongzhan Huang, Peter B. McGarvey, Cathy H. Wu and the UniProt Consortium