| J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, A. Courbet, R. J. de Haas, N. Bethel, P. J. Y. Leung, T. F. Huddy, S. Pellock, D. Tischer, F. Chan, B. Koepnick, H. Nguyen, A. Kang, B. Sankaran, A. K. Bera, N. P. King, D. Baker
This supplementary material provides details on the training and architecture of ProteinMPNN, a deep learning model for protein sequence design. The model was trained on a dataset of proteins from the CATH 4.2 40% non-redundant set, with modifications to the architecture for single and multi-chain experiments. For single-chain models, additional edge features were introduced, including Gaussian radial basis functions for residue distances. For multi-chain models, the training data included protein assemblies from the PDB with high resolution and low residue count, and sequences were clustered for training, validation, and testing. The loss function used negative log likelihood with label smoothing, and the model was optimized using Adam with a learning rate schedule. The model architecture included encoder-decoder message passing networks with multiple layers and hidden dimensions. Input features included edge distances and relative positional encoding, with no node features used. The model was tested against AlphaFold benchmarks, showing improved sequence recovery and structure prediction. Additional experiments included analysis of amino acid compositional bias, comparison with Rosetta, and evaluation of the model's performance with different noise levels and sampling temperatures. The model was also tested with Ca-only input features, showing comparable performance. Experimental methods included protein expression, purification, and crystallization, with crystallographic data deposited to the PDB. Figures show sequence recovery, AlphaFold success rates, and comparisons with other models. The results demonstrate the effectiveness of ProteinMPNN in generating protein sequences with high accuracy and structural fidelity.This supplementary material provides details on the training and architecture of ProteinMPNN, a deep learning model for protein sequence design. The model was trained on a dataset of proteins from the CATH 4.2 40% non-redundant set, with modifications to the architecture for single and multi-chain experiments. For single-chain models, additional edge features were introduced, including Gaussian radial basis functions for residue distances. For multi-chain models, the training data included protein assemblies from the PDB with high resolution and low residue count, and sequences were clustered for training, validation, and testing. The loss function used negative log likelihood with label smoothing, and the model was optimized using Adam with a learning rate schedule. The model architecture included encoder-decoder message passing networks with multiple layers and hidden dimensions. Input features included edge distances and relative positional encoding, with no node features used. The model was tested against AlphaFold benchmarks, showing improved sequence recovery and structure prediction. Additional experiments included analysis of amino acid compositional bias, comparison with Rosetta, and evaluation of the model's performance with different noise levels and sampling temperatures. The model was also tested with Ca-only input features, showing comparable performance. Experimental methods included protein expression, purification, and crystallization, with crystallographic data deposited to the PDB. Figures show sequence recovery, AlphaFold success rates, and comparisons with other models. The results demonstrate the effectiveness of ProteinMPNN in generating protein sequences with high accuracy and structural fidelity.