02 July 2024 | Ziyi Zhou, Liang Zhang, Yuanxi Yu, Banghao Wu, Mingchen Li, Liang Hong, Pan Tan
This paper introduces FSFP (Few-Shot Learning for Protein Fitness Prediction), a novel training strategy that enhances the efficiency of protein language models (PLMs) by optimizing their performance under extreme data scarcity. FSFP combines meta-transfer learning, learning to rank, and parameter-efficient fine-tuning to significantly boost the accuracy of various PLMs using only tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. The approach is further validated through wet-lab experiments on Phi29 DNA polymerase, achieving a 25% increase in the positive rate. These results highlight the potential of FSFP in aiding AI-guided protein engineering. The method leverages meta-learning to train PLMs on limited data, using a combination of existing labeled mutant datasets and pseudo labels generated from multiple sequence alignment (MSA) to build auxiliary tasks. Meta-training with model-agnostic meta-learning (MAML) optimizes the initial model parameters, while low-rank adaptation (LoRA) prevents overfitting. The final model is then fine-tuned on the target few-shot learning task, treating fitness prediction as a ranking problem using listwise ranking loss (ListMLE). FSFP demonstrates robust generalizability and extrapolation ability, outperforming other methods in predicting single-site and multi-site mutants. The practical efficacy of FSFP is demonstrated through its successful application in engineering Phi29 DNA polymerase, showing significant improvements in both average melting temperature (Tm) and positive rate.This paper introduces FSFP (Few-Shot Learning for Protein Fitness Prediction), a novel training strategy that enhances the efficiency of protein language models (PLMs) by optimizing their performance under extreme data scarcity. FSFP combines meta-transfer learning, learning to rank, and parameter-efficient fine-tuning to significantly boost the accuracy of various PLMs using only tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. The approach is further validated through wet-lab experiments on Phi29 DNA polymerase, achieving a 25% increase in the positive rate. These results highlight the potential of FSFP in aiding AI-guided protein engineering. The method leverages meta-learning to train PLMs on limited data, using a combination of existing labeled mutant datasets and pseudo labels generated from multiple sequence alignment (MSA) to build auxiliary tasks. Meta-training with model-agnostic meta-learning (MAML) optimizes the initial model parameters, while low-rank adaptation (LoRA) prevents overfitting. The final model is then fine-tuned on the target few-shot learning task, treating fitness prediction as a ranking problem using listwise ranking loss (ListMLE). FSFP demonstrates robust generalizability and extrapolation ability, outperforming other methods in predicting single-site and multi-site mutants. The practical efficacy of FSFP is demonstrated through its successful application in engineering Phi29 DNA polymerase, showing significant improvements in both average melting temperature (Tm) and positive rate.