Conditional language models enable the efficient design of proficient enzymes

Conditional language models enable the efficient design of proficient enzymes

May 5, 2024 | Geraldene Munsamy, Ramiro Illanes-Vicioso, Silvia Funcillo, Ioanna T. Nakou, Sebastian Lindner, Gavin Ayres, Lesley S. Sheehan, Steven Moss, Ulrich Eckhard, Philipp Lorenz, Noelia Ferruz
The paper introduces ZymCTRL, a conditional language model designed to generate catalytically active artificial enzymes based on user-defined specifications. ZymCTRL was trained on the BRENDA database, which contains 37 million enzyme sequences classified by EC numbers. The model is capable of generating enzymes with high activity and well-folded structures, even at sequence identities below 40% compared to natural proteins. Experimental validation using carbonic anhydrases and lactate dehydrogenases demonstrated the model's effectiveness. Specifically, seven out of ten beta carbonic anhydrases generated in zero-shot showed activity, with two close to wild-type levels. For lactate dehydrogenases, fine-tuning the model on diverse metagenomic sequences improved the likelihood of generating sequences that pass in silico quality metrics and exhibited comparable activity to natural counterparts. Two of the generated lactate dehydrogenases were scaled up, lyophilized, and successfully integrated into enzymatic cascades under extreme conditions. The findings highlight the potential of conditional language models in rapidly and cost-effectively designing proficient artificial enzymes, with the model and training data freely available to the community.The paper introduces ZymCTRL, a conditional language model designed to generate catalytically active artificial enzymes based on user-defined specifications. ZymCTRL was trained on the BRENDA database, which contains 37 million enzyme sequences classified by EC numbers. The model is capable of generating enzymes with high activity and well-folded structures, even at sequence identities below 40% compared to natural proteins. Experimental validation using carbonic anhydrases and lactate dehydrogenases demonstrated the model's effectiveness. Specifically, seven out of ten beta carbonic anhydrases generated in zero-shot showed activity, with two close to wild-type levels. For lactate dehydrogenases, fine-tuning the model on diverse metagenomic sequences improved the likelihood of generating sequences that pass in silico quality metrics and exhibited comparable activity to natural counterparts. Two of the generated lactate dehydrogenases were scaled up, lyophilized, and successfully integrated into enzymatic cascades under extreme conditions. The findings highlight the potential of conditional language models in rapidly and cost-effectively designing proficient artificial enzymes, with the model and training data freely available to the community.
Reach us at info@study.space
Understanding Conditional language models enable the efficient design of proficient enzymes