Published online 30 October 2020 | Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A. Salazar, Erik L.L. Sonnhammer, Silvio C.E. Tosatto, Lisanna Paladin, Shriya Raj, Lorna J. Richardson, Robert D. Finn and Alex Bateman
The Pfam database, a widely used resource for classifying protein sequences into families and domains, has been updated to version 33.1, adding over 350 new families and numerous improvements to existing entries. To facilitate research on COVID-19, Pfam has revised entries covering the SARS-CoV-2 proteome and built new entries for regions not previously covered. The reintroduction of Pfam-B, which provides an automatically generated supplement with 136,730 novel sequence clusters, is highlighted. Pfam-B is based on clustering by the MMseqs2 software. Comparisons between Pfam and RepeatsDB have been made, and results are being used to refine Pfam repeat families. Pfam maintains a sequence coverage of approximately 77% and a residue coverage of 53% of UniProtKB, despite a significant increase in UniProtKB size. The article also discusses updates to family building, including the use of Pfam-B clusters, metagenomic sequence clusters, PDB structures, and community submissions. Additionally, Pfam has revised its type definitions, particularly for families with predicted coiled-coil regions, and identified families with a high proportion of disordered residues for reclassification. The article concludes with a discussion on the ongoing efforts to improve Pfam's coverage and the impact of the COVID-19 pandemic on research and model improvements.The Pfam database, a widely used resource for classifying protein sequences into families and domains, has been updated to version 33.1, adding over 350 new families and numerous improvements to existing entries. To facilitate research on COVID-19, Pfam has revised entries covering the SARS-CoV-2 proteome and built new entries for regions not previously covered. The reintroduction of Pfam-B, which provides an automatically generated supplement with 136,730 novel sequence clusters, is highlighted. Pfam-B is based on clustering by the MMseqs2 software. Comparisons between Pfam and RepeatsDB have been made, and results are being used to refine Pfam repeat families. Pfam maintains a sequence coverage of approximately 77% and a residue coverage of 53% of UniProtKB, despite a significant increase in UniProtKB size. The article also discusses updates to family building, including the use of Pfam-B clusters, metagenomic sequence clusters, PDB structures, and community submissions. Additionally, Pfam has revised its type definitions, particularly for families with predicted coiled-coil regions, and identified families with a high proportion of disordered residues for reclassification. The article concludes with a discussion on the ongoing efforts to improve Pfam's coverage and the impact of the COVID-19 pandemic on research and model improvements.