[slides] Pfam%3A The protein families database in 2021

The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since its last description, Pfam 33.1 has added over 350 new families and improved existing entries. To support research on SARS-CoV-2, Pfam has revised entries covering the SARS-CoV-2 proteome and created new entries for uncovered regions. Pfam-B, an automatically generated supplement, has been reintroduced, containing 136,730 novel clusters not matched by Pfam families. Pfam-B is based on clustering using MMseqs2 software. Pfam has compared RepeatsDB regions to Pfam and started using results to build and refine Pfam repeat families. Pfam is freely available at http://pfam.xfam.org/. Pfam 33.1 contains 18,259 families and 635 clans. Since Pfam 32.0, 355 new families and 8 new clans have been added. Over 39% of Pfam families are in clans. 77.0% of UniProtKB sequences have at least one Pfam match, and 53.2% of residues are in Pfam entries. These coverage figures have remained stable despite a 240% increase in UniProtKB size. Pfam maintains high coverage due to new sequences matching existing models. Pfam 33.1 has 75.1% sequence and 49.4% residue coverage for UniProtKB reference proteomes, slightly lower than the overall UniProtKB coverage. New families were added from Pfam-B, protein structures, and metagenomic clusters. Pfam has created many entries for domains of unknown function (DUF) and uncharacterized protein families (UPF). Over 1132 DUF or UPF families have been assigned functions. Pfam 33.1 contains 4244 DUF or UPF families, which is 23% of all Pfam families. Pfam has updated many families, including those for SARS-CoV-2 proteins. New families were built for the SARS-CoV-2 proteome, including NSP6. Pfam has also updated non-structural proteins, including NSP3, NSP4, and NSP5. Pfam has improved the coverage of SARS-CoV-2 proteins, with only Orf10 remaining unannotated. Pfam-B has been re-introduced, using MMseqs2 for clustering. Pfam-B contains 136,730 families, with an average of 99 sequences. Pfam-B is released as a tar archive and not integrated into the Pfam website. Pfam type definitions have been updated to improve classification. Pfam has reclassified families with high disorderedThe Pfam database is a widely used resource for classifying protein sequences into families and domains. Since its last description, Pfam 33.1 has added over 350 new families and improved existing entries. To support research on SARS-CoV-2, Pfam has revised entries covering the SARS-CoV-2 proteome and created new entries for uncovered regions. Pfam-B, an automatically generated supplement, has been reintroduced, containing 136,730 novel clusters not matched by Pfam families. Pfam-B is based on clustering using MMseqs2 software. Pfam has compared RepeatsDB regions to Pfam and started using results to build and refine Pfam repeat families. Pfam is freely available at http://pfam.xfam.org/. Pfam 33.1 contains 18,259 families and 635 clans. Since Pfam 32.0, 355 new families and 8 new clans have been added. Over 39% of Pfam families are in clans. 77.0% of UniProtKB sequences have at least one Pfam match, and 53.2% of residues are in Pfam entries. These coverage figures have remained stable despite a 240% increase in UniProtKB size. Pfam maintains high coverage due to new sequences matching existing models. Pfam 33.1 has 75.1% sequence and 49.4% residue coverage for UniProtKB reference proteomes, slightly lower than the overall UniProtKB coverage. New families were added from Pfam-B, protein structures, and metagenomic clusters. Pfam has created many entries for domains of unknown function (DUF) and uncharacterized protein families (UPF). Over 1132 DUF or UPF families have been assigned functions. Pfam 33.1 contains 4244 DUF or UPF families, which is 23% of all Pfam families. Pfam has updated many families, including those for SARS-CoV-2 proteins. New families were built for the SARS-CoV-2 proteome, including NSP6. Pfam has also updated non-structural proteins, including NSP3, NSP4, and NSP5. Pfam has improved the coverage of SARS-CoV-2 proteins, with only Orf10 remaining unannotated. Pfam-B has been re-introduced, using MMseqs2 for clustering. Pfam-B contains 136,730 families, with an average of 99 sequences. Pfam-B is released as a tar archive and not integrated into the Pfam website. Pfam type definitions have been updated to improve classification. Pfam has reclassified families with high disordered

Pfam: The protein families database in 2021

2021 | Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A. Salazar, Erik L.L. Sonnhammer, Silvio C.E. Tosatto, Lisanna Paladin, Shriya Raj, Lorna J. Richardson, Robert D. Finn and Alex Bateman