02 July 2024 | Ziyi Zhou, Liang Zhang, Yuanxi Yu, Banghao Wu, Mingchen Li, Liang Hong & Pan Tan
This study introduces FSFP, a training strategy that enhances the performance of protein language models (PLMs) using minimal wet-lab data through few-shot learning. FSFP combines meta-transfer learning, learning to rank, and parameter-efficient fine-tuning to significantly improve the accuracy of various PLMs using only tens of labeled single-site mutants. In-silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, FSFP is successfully applied to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results highlight the potential of FSFP in aiding AI-guided protein engineering.
Proteins are essential for biological activities, and their applications in scientific research and industrial production are increasing due to their biocatalytic properties. However, most wild-type proteins are not suitable for industrial conditions. Protein engineering aims to enhance these properties, but traditional methods face challenges in screening mutant libraries and lack detailed structural knowledge. Deep learning has shown potential in uncovering the relationships between protein sequences and their functionality, aiding in exploring the design space.
Deep learning approaches are categorized into supervised and unsupervised models, with the main difference being whether training data requires manual labels. Pre-trained PLMs, such as ESM-2, ProGen, SaProt, and ProtT5, are the most trending unsupervised approaches for fitness prediction. These models can estimate probability distributions for various protein sequences without experimental data but have limited accuracy. Supervised deep learning models, on the other hand, have shown high accuracy in predicting protein fitness but require extensive data from expensive mutagenesis experiments.
FSFP leverages meta-learning to better train PLMs in a label-scarce scenario. It uses MAML, a gradient-based meta-learning method, to meta-train PLMs on built tasks. FSFP also uses LoRA to reduce overfitting risk when encountering small training datasets. After meta-training, FSFP treats fitness prediction as a ranking problem and leverages the LTR technique for both transfer learning and meta-training. FSFP's performance is evaluated on the substitution benchmark of ProteinGym, showing robustness when adapting to different PLMs and proteins.
FSFP is applied to three representative PLMs—ESM-1v, ESM-2, and SaProt—demonstrating its effectiveness in few-shot learning. FSFP outperforms other baselines, including ridge regression, on all training data sizes. FSFP also shows robust generalizability and extrapolation ability, performing well on mutations whose positions are absent in the training data. FSFP is applied to engineer Phi29 DNA polymerase, achieving a 25% increase in the positive rate. These results affirm the potential of FSFP in accelerating the iterative cycle of design and testing in protein engineering.This study introduces FSFP, a training strategy that enhances the performance of protein language models (PLMs) using minimal wet-lab data through few-shot learning. FSFP combines meta-transfer learning, learning to rank, and parameter-efficient fine-tuning to significantly improve the accuracy of various PLMs using only tens of labeled single-site mutants. In-silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, FSFP is successfully applied to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results highlight the potential of FSFP in aiding AI-guided protein engineering.
Proteins are essential for biological activities, and their applications in scientific research and industrial production are increasing due to their biocatalytic properties. However, most wild-type proteins are not suitable for industrial conditions. Protein engineering aims to enhance these properties, but traditional methods face challenges in screening mutant libraries and lack detailed structural knowledge. Deep learning has shown potential in uncovering the relationships between protein sequences and their functionality, aiding in exploring the design space.
Deep learning approaches are categorized into supervised and unsupervised models, with the main difference being whether training data requires manual labels. Pre-trained PLMs, such as ESM-2, ProGen, SaProt, and ProtT5, are the most trending unsupervised approaches for fitness prediction. These models can estimate probability distributions for various protein sequences without experimental data but have limited accuracy. Supervised deep learning models, on the other hand, have shown high accuracy in predicting protein fitness but require extensive data from expensive mutagenesis experiments.
FSFP leverages meta-learning to better train PLMs in a label-scarce scenario. It uses MAML, a gradient-based meta-learning method, to meta-train PLMs on built tasks. FSFP also uses LoRA to reduce overfitting risk when encountering small training datasets. After meta-training, FSFP treats fitness prediction as a ranking problem and leverages the LTR technique for both transfer learning and meta-training. FSFP's performance is evaluated on the substitution benchmark of ProteinGym, showing robustness when adapting to different PLMs and proteins.
FSFP is applied to three representative PLMs—ESM-1v, ESM-2, and SaProt—demonstrating its effectiveness in few-shot learning. FSFP outperforms other baselines, including ridge regression, on all training data sizes. FSFP also shows robust generalizability and extrapolation ability, performing well on mutations whose positions are absent in the training data. FSFP is applied to engineer Phi29 DNA polymerase, achieving a 25% increase in the positive rate. These results affirm the potential of FSFP in accelerating the iterative cycle of design and testing in protein engineering.