19 Jun 2024 | Arian R. Jamasb*,1,†, Alex Morehead*,2, Chaitanya K. Joshi*,1, Zuobai Zhang*,3, Kieran Did†, Simon Mathis†, Charles Harris†, Jian Tang†, Jianlin Cheng†, Pietro Liò†, Tom L. Blundell†
The paper introduces *ProteinWorkshop*, a comprehensive benchmark suite for evaluating representation learning on protein structures using Geometric Graph Neural Networks (GNNs). The benchmark aims to systematically assess the quality of learned structural representations and their effectiveness in capturing functional relationships for downstream tasks. Key findings include:
1. **Large-scale Pretraining**: Large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs.
2. **Expressiveness**: Equivariant GNNs benefit more from pretraining compared to invariant models.
3. **Featurisation Schemes**: Cα atoms, virtual angles, and backbone torsions provide the best performance overall.
4. **Pretraining Datasets**: The benchmark includes large corpora of experimental and predicted structures, such as AlphaFoldDB and ESM Atlas, for pretraining.
5. **Downstream Tasks**: The benchmark evaluates a wide range of tasks, including inverse folding, protein-protein interaction site prediction, metal binding site prediction, and post-translational modification site prediction.
The paper also discusses the design and implementation details of the benchmark, including the modular structure, featurisation schemes, pretraining tasks, and downstream tasks. The results show that incorporating more structural detail in input representations improves pretraining and downstream performance. The benchmark is open-source and available at [github.com/a-r-j/ProteinWorkshop](https://github.com/a-r-j/ProteinWorkshop), aiming to facilitate rigorous evaluation and advancement in protein structure representation learning.The paper introduces *ProteinWorkshop*, a comprehensive benchmark suite for evaluating representation learning on protein structures using Geometric Graph Neural Networks (GNNs). The benchmark aims to systematically assess the quality of learned structural representations and their effectiveness in capturing functional relationships for downstream tasks. Key findings include:
1. **Large-scale Pretraining**: Large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs.
2. **Expressiveness**: Equivariant GNNs benefit more from pretraining compared to invariant models.
3. **Featurisation Schemes**: Cα atoms, virtual angles, and backbone torsions provide the best performance overall.
4. **Pretraining Datasets**: The benchmark includes large corpora of experimental and predicted structures, such as AlphaFoldDB and ESM Atlas, for pretraining.
5. **Downstream Tasks**: The benchmark evaluates a wide range of tasks, including inverse folding, protein-protein interaction site prediction, metal binding site prediction, and post-translational modification site prediction.
The paper also discusses the design and implementation details of the benchmark, including the modular structure, featurisation schemes, pretraining tasks, and downstream tasks. The results show that incorporating more structural detail in input representations improves pretraining and downstream performance. The benchmark is open-source and available at [github.com/a-r-j/ProteinWorkshop](https://github.com/a-r-j/ProteinWorkshop), aiming to facilitate rigorous evaluation and advancement in protein structure representation learning.