3D-MoLM: Towards 3D Molecule-Text Interpretation in Language Models

3D-MoLM: Towards 3D Molecule-Text Interpretation in Language Models

2024 | Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, Qi Tian
3D-MoLM: Towards 3D Molecule-Text Interpretation in Language Models This paper introduces 3D-MoLM, a new framework for 3D molecule-text interpretation. 3D-MoLM enables a language model (LM) to interpret and analyze 3D molecular structures by integrating a 3D molecular encoder. This integration is achieved through a 3D molecule-text projector, which bridges the representation space of the 3D molecular encoder with the LM's input space. Additionally, a 3D molecule-centric instruction tuning dataset, 3D-MoIT, is curated to enhance the model's ability to follow human instructions and understand 3D-dependent molecular properties. The 3D-MoLM framework consists of three key components: a 3D molecular encoder, a 3D molecule-text projector, and an LM. The 3D molecular encoder is based on Uni-Mol, which is pretrained on a large molecule dataset. The 3D molecule-text projector, inspired by vision-language models, is designed to map the 3D molecular encoder's representations into the LM's input space. The LM is then adapted to understand 3D molecular structures. To address the challenges of 3D molecule-text alignment and instruction tuning, a three-stage training pipeline is proposed. The first stage focuses on 3D molecule-text representation learning, the second stage on 3D molecule-text alignment via generative learning, and the third stage on 3D molecule-centric instruction tuning. The 3D-MoIT dataset is used for instruction tuning, which includes data from PubChem and PubChemQC. This dataset is transformed into instruction following format to cultivate the model's ability to follow instructions and understand 3D-dependent properties. Extensive experiments demonstrate that 3D-MoLM excels in various tasks, including molecule-text retrieval, molecule captioning, and open-text molecular QA. It significantly surpasses existing baselines on these tasks, especially in 3D-dependent properties. The model's performance is evaluated on the PubChem dataset, where it outperforms baselines in molecule-text retrieval and molecule captioning. Additionally, it achieves state-of-the-art results in open-text molecular QA tasks, particularly in predicting properties that are intrinsically determined by 3D conformations. The contributions of this work include the proposal of 3D-MoLM, the curation of 3D-MoIT, and the demonstration of 3D-MoLM's effectiveness in various downstream tasks. The model's ability to interpret and analyze 3D molecular structures is validated through extensive experiments, highlighting the importance of 3D molecular representation learning in enhancing cross-modal molecular understanding.3D-MoLM: Towards 3D Molecule-Text Interpretation in Language Models This paper introduces 3D-MoLM, a new framework for 3D molecule-text interpretation. 3D-MoLM enables a language model (LM) to interpret and analyze 3D molecular structures by integrating a 3D molecular encoder. This integration is achieved through a 3D molecule-text projector, which bridges the representation space of the 3D molecular encoder with the LM's input space. Additionally, a 3D molecule-centric instruction tuning dataset, 3D-MoIT, is curated to enhance the model's ability to follow human instructions and understand 3D-dependent molecular properties. The 3D-MoLM framework consists of three key components: a 3D molecular encoder, a 3D molecule-text projector, and an LM. The 3D molecular encoder is based on Uni-Mol, which is pretrained on a large molecule dataset. The 3D molecule-text projector, inspired by vision-language models, is designed to map the 3D molecular encoder's representations into the LM's input space. The LM is then adapted to understand 3D molecular structures. To address the challenges of 3D molecule-text alignment and instruction tuning, a three-stage training pipeline is proposed. The first stage focuses on 3D molecule-text representation learning, the second stage on 3D molecule-text alignment via generative learning, and the third stage on 3D molecule-centric instruction tuning. The 3D-MoIT dataset is used for instruction tuning, which includes data from PubChem and PubChemQC. This dataset is transformed into instruction following format to cultivate the model's ability to follow instructions and understand 3D-dependent properties. Extensive experiments demonstrate that 3D-MoLM excels in various tasks, including molecule-text retrieval, molecule captioning, and open-text molecular QA. It significantly surpasses existing baselines on these tasks, especially in 3D-dependent properties. The model's performance is evaluated on the PubChem dataset, where it outperforms baselines in molecule-text retrieval and molecule captioning. Additionally, it achieves state-of-the-art results in open-text molecular QA tasks, particularly in predicting properties that are intrinsically determined by 3D conformations. The contributions of this work include the proposal of 3D-MoLM, the curation of 3D-MoIT, and the demonstration of 3D-MoLM's effectiveness in various downstream tasks. The model's ability to interpret and analyze 3D molecular structures is validated through extensive experiments, highlighting the importance of 3D molecular representation learning in enhancing cross-modal molecular understanding.
Reach us at info@study.space