Conditional language models enable the efficient design of proficient enzymes

Conditional language models enable the efficient design of proficient enzymes

May 5, 2024 | Geraldene Munsamy, Ramiro Illanes-Vicioso, Silvia Funcillo, Ioanna T. Nakou, Sebastian Lindner, Gavin Ayres, Lesley S. Sheehan, Steven Moss, Ulrich Eckhard, Philipp Lorenz, Noelia Ferruz
Conditional language models enable the efficient design of proficient enzymes. ZymCTRL, a conditional language model trained on the enzyme sequence space, can generate enzymes based on user-defined specifications. Experimental validation across diverse data regimes and enzyme families demonstrated ZymCTRL's ability to generate active enzymes across various sequence identity ranges. Specifically, carbonic anhydrases and lactate dehydrogenase were designed in zero-shot, showing activity at sequence identities below 40% compared to natural proteins without further training. Biophysical analysis confirmed the globularity and well-folded nature of the generated sequences. Fine-tuning enabled the generation of lactate dehydrogenase more likely to pass in silico filters and with activity comparable to natural counterparts. Two artificial lactate dehydrogenases were scaled up and successfully lyophilized, maintaining activity and showing preliminary conversion in one-pot enzymatic cascades under extreme conditions. ZymCTRL represents a timely advancement toward conditional, cost-effective enzyme design. The model and training data are freely available to the community. ZymCTRL was trained on the BRENDA database, comprising 37M enzyme sequences classified by enzyme commission (EC) numbers. Each sequence was linked to its associated EC class during training, enabling the model to learn sequence features specific to each catalytic reaction. To address representation bias, EC classes were tokenized, allowing the model to transfer learned insights across catalytic reactions. The model was tested under different scenarios, generating sequences with varying identities to the natural space. The generated sequences were found to be novel yet predicted ordered and functional, with high pLDDT values indicating potential for catalytic activity. Carbonic anhydrases were tested, with 20 generated sequences showing activity, with two close to natural ones. Lactate dehydrogenases were also tested, with 20 generated sequences showing activity, with some sequences retaining activity after lyophilization and integration into enzymatic cascades. ZymCTRL's embedding space distinguishes enzyme functional classes, with sequences from different enzyme classes occupying distinct regions. The model's ability to generate active enzymes in zero-shot and through fine-tuning highlights its potential for efficient enzyme design. The findings demonstrate the potential of conditional language models in the design of proficient enzymes, with the model and training data freely available to the community.Conditional language models enable the efficient design of proficient enzymes. ZymCTRL, a conditional language model trained on the enzyme sequence space, can generate enzymes based on user-defined specifications. Experimental validation across diverse data regimes and enzyme families demonstrated ZymCTRL's ability to generate active enzymes across various sequence identity ranges. Specifically, carbonic anhydrases and lactate dehydrogenase were designed in zero-shot, showing activity at sequence identities below 40% compared to natural proteins without further training. Biophysical analysis confirmed the globularity and well-folded nature of the generated sequences. Fine-tuning enabled the generation of lactate dehydrogenase more likely to pass in silico filters and with activity comparable to natural counterparts. Two artificial lactate dehydrogenases were scaled up and successfully lyophilized, maintaining activity and showing preliminary conversion in one-pot enzymatic cascades under extreme conditions. ZymCTRL represents a timely advancement toward conditional, cost-effective enzyme design. The model and training data are freely available to the community. ZymCTRL was trained on the BRENDA database, comprising 37M enzyme sequences classified by enzyme commission (EC) numbers. Each sequence was linked to its associated EC class during training, enabling the model to learn sequence features specific to each catalytic reaction. To address representation bias, EC classes were tokenized, allowing the model to transfer learned insights across catalytic reactions. The model was tested under different scenarios, generating sequences with varying identities to the natural space. The generated sequences were found to be novel yet predicted ordered and functional, with high pLDDT values indicating potential for catalytic activity. Carbonic anhydrases were tested, with 20 generated sequences showing activity, with two close to natural ones. Lactate dehydrogenases were also tested, with 20 generated sequences showing activity, with some sequences retaining activity after lyophilization and integration into enzymatic cascades. ZymCTRL's embedding space distinguishes enzyme functional classes, with sequences from different enzyme classes occupying distinct regions. The model's ability to generate active enzymes in zero-shot and through fine-tuning highlights its potential for efficient enzyme design. The findings demonstrate the potential of conditional language models in the design of proficient enzymes, with the model and training data freely available to the community.
Reach us at info@study.space
[slides and audio] Conditional language models enable the efficient design of proficient enzymes