Protein language models are biased by unequal sequence sampling across the tree of life

Protein language models are biased by unequal sequence sampling across the tree of life

March 12, 2024 | Frances Ding, Jacob Steinhardt
Protein language models (pLMs) trained on large protein sequence databases are used to understand disease and design novel proteins. However, these models have been found to encode a species bias, where the likelihood of a protein sequence from certain species is systematically higher than others, independent of the specific protein. This bias arises due to unequal species representation in popular protein sequence databases. The study quantifies this bias and demonstrates that it can be detrimental for protein design applications, such as enhancing thermostability. The bias is explained by the imbalance in species representation and evolutionary relationships in the training data. The findings highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored sequence spaces. The results suggest that protein designers should carefully consider the species bias when using pLM likelihoods and potentially tailor training data to specific applications.Protein language models (pLMs) trained on large protein sequence databases are used to understand disease and design novel proteins. However, these models have been found to encode a species bias, where the likelihood of a protein sequence from certain species is systematically higher than others, independent of the specific protein. This bias arises due to unequal species representation in popular protein sequence databases. The study quantifies this bias and demonstrates that it can be detrimental for protein design applications, such as enhancing thermostability. The bias is explained by the imbalance in species representation and evolutionary relationships in the training data. The findings highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored sequence spaces. The results suggest that protein designers should carefully consider the species bias when using pLM likelihoods and potentially tailor training data to specific applications.
Reach us at info@study.space
Understanding Protein language models are biased by unequal sequence sampling across the tree of life