10 Jun 2024 | Juan Manuel Zambrano Chaves*, Eric Wang*, Tao Tu, Eeshit Dhaval Vaishnav, Byron Lee, S. Sara Mahdavi, Christopher Semturs, David Fleet, Vivek Natarajan†, Shekoofeh Azizi†
Tx-LLM is a large language model (LLM) fine-tuned from PaLM-2 to encode knowledge about diverse therapeutic modalities. It is trained on 709 datasets targeting 66 tasks across the drug discovery pipeline, including predicting drug properties, toxicity, and drug synergy. Tx-LLM performs competitively with state-of-the-art (SOTA) models on 43 out of 66 tasks and exceeds SOTA on 22. It is particularly effective for tasks combining molecular SMILES representations with text, likely due to context learned during pretraining. Tx-LLM shows positive transfer between tasks involving diverse drug types, such as small molecules and proteins. The model's performance is influenced by model size, domain fine-tuning, and prompting strategies. Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could serve as an end-to-end tool for drug discovery. The model is trained on a diverse set of datasets, including those with molecular, protein, nucleic acid, and text-based features. It is evaluated on various tasks, including binary classification, regression, and generation, and shows strong performance on tasks involving text and molecular representations. Tx-LLM is effective at combining SMILES and text data, which may be due to the natural representation of text for LLMs and the context learned during pretraining. The model also shows positive transfer across diverse drug types, suggesting that a generalist LLM may be effective for tasks involving both drugs and targets. Tx-LLM is not instruction-tuned to follow natural language, which limits its ability to explain predictions. The model is trained on a wide variety of tasks and could have a future role in end-to-end therapeutic development. However, it is not effective for every task and further improvements are needed. Tx-LLM is a promising step towards using AI to enhance therapeutic development.Tx-LLM is a large language model (LLM) fine-tuned from PaLM-2 to encode knowledge about diverse therapeutic modalities. It is trained on 709 datasets targeting 66 tasks across the drug discovery pipeline, including predicting drug properties, toxicity, and drug synergy. Tx-LLM performs competitively with state-of-the-art (SOTA) models on 43 out of 66 tasks and exceeds SOTA on 22. It is particularly effective for tasks combining molecular SMILES representations with text, likely due to context learned during pretraining. Tx-LLM shows positive transfer between tasks involving diverse drug types, such as small molecules and proteins. The model's performance is influenced by model size, domain fine-tuning, and prompting strategies. Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could serve as an end-to-end tool for drug discovery. The model is trained on a diverse set of datasets, including those with molecular, protein, nucleic acid, and text-based features. It is evaluated on various tasks, including binary classification, regression, and generation, and shows strong performance on tasks involving text and molecular representations. Tx-LLM is effective at combining SMILES and text data, which may be due to the natural representation of text for LLMs and the context learned during pretraining. The model also shows positive transfer across diverse drug types, suggesting that a generalist LLM may be effective for tasks involving both drugs and targets. Tx-LLM is not instruction-tuned to follow natural language, which limits its ability to explain predictions. The model is trained on a wide variety of tasks and could have a future role in end-to-end therapeutic development. However, it is not effective for every task and further improvements are needed. Tx-LLM is a promising step towards using AI to enhance therapeutic development.