20 Feb 2024 | Haisong Gong, Qiang Liu, Shu Wu, Liang Wang
This paper introduces TGM-DLM, a novel diffusion language model for text-guided molecule generation. TGM-DLM addresses the limitations of autoregressive methods by using a two-phase diffusion generation process. In the first phase, embeddings are optimized from random noise guided by text descriptions. In the second phase, invalid SMILES strings are corrected to form valid molecular representations. TGM-DLM outperforms MolT5-Base, an autoregressive model, without requiring additional data resources. The model generates coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. The code is available at https://github.com/Deno-V/tgm-dlm. The paper also discusses related work, including SMILES-based molecule generation and diffusion models for language generation. The method is evaluated on the ChEBI-20 dataset, showing superior performance in various metrics, including exact match score and fingerprinting metrics. The results demonstrate that TGM-DLM achieves these improvements without additional data or pre-training, highlighting its effectiveness in text-guided molecule generation.This paper introduces TGM-DLM, a novel diffusion language model for text-guided molecule generation. TGM-DLM addresses the limitations of autoregressive methods by using a two-phase diffusion generation process. In the first phase, embeddings are optimized from random noise guided by text descriptions. In the second phase, invalid SMILES strings are corrected to form valid molecular representations. TGM-DLM outperforms MolT5-Base, an autoregressive model, without requiring additional data resources. The model generates coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. The code is available at https://github.com/Deno-V/tgm-dlm. The paper also discusses related work, including SMILES-based molecule generation and diffusion models for language generation. The method is evaluated on the ChEBI-20 dataset, showing superior performance in various metrics, including exact match score and fingerprinting metrics. The results demonstrate that TGM-DLM achieves these improvements without additional data or pre-training, highlighting its effectiveness in text-guided molecule generation.