2024 | Arian R. Jamasb*, Alex Morehead*, Chaitanya K. Joshi*, Zuobai Zhang*, Kieran Didi, Simon Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Liò, Tom L. Blundell
ProteinWorkshop is a comprehensive benchmark suite for evaluating representation learning on protein structures using Geometric Graph Neural Networks (GNNs). The study explores large-scale pre-training and downstream tasks on both experimental and predicted structures to systematically assess the quality of learned structural representations and their utility in capturing functional relationships. Key findings include that large-scale pre-training on AlphaFold structures and auxiliary tasks consistently improve performance of both rotation-invariant and equivariant GNNs, with equivariant models benefiting more from pre-training. The benchmark provides storage-efficient dataloaders for large-scale structural databases and utilities for constructing new tasks from the PDB. ProteinWorkshop includes a wide range of pre-training tasks, such as sequence and structure denoising, and downstream tasks like fold prediction, gene ontology prediction, and antibody developability prediction. The benchmark evaluates various GNN architectures and featurisation schemes, showing that incorporating more structural detail improves performance. The study highlights the importance of structural pre-training and auxiliary tasks in enhancing downstream tasks. The benchmark is open-source and available for use by the research community.ProteinWorkshop is a comprehensive benchmark suite for evaluating representation learning on protein structures using Geometric Graph Neural Networks (GNNs). The study explores large-scale pre-training and downstream tasks on both experimental and predicted structures to systematically assess the quality of learned structural representations and their utility in capturing functional relationships. Key findings include that large-scale pre-training on AlphaFold structures and auxiliary tasks consistently improve performance of both rotation-invariant and equivariant GNNs, with equivariant models benefiting more from pre-training. The benchmark provides storage-efficient dataloaders for large-scale structural databases and utilities for constructing new tasks from the PDB. ProteinWorkshop includes a wide range of pre-training tasks, such as sequence and structure denoising, and downstream tasks like fold prediction, gene ontology prediction, and antibody developability prediction. The benchmark evaluates various GNN architectures and featurisation schemes, showing that incorporating more structural detail improves performance. The study highlights the importance of structural pre-training and auxiliary tasks in enhancing downstream tasks. The benchmark is open-source and available for use by the research community.