Understanding Towards 3D Molecule-Text Interpretation in Language Models

The paper introduces 3D-MoLM, a novel framework for 3D molecule-text interpretation, which enables language models (LMs) to interpret and analyze 3D molecular structures through text generation. The key components of 3D-MoLM include a 3D molecule-text projector and a 3D molecular encoder, which align 3D molecular representations with the LM's input space. The 3D molecule-text projector, inspired by vision-language models, uses Q-Former to map 3D molecular representations into the LM's textual space. The 3D molecular encoder, Uni-Mol, encodes 3D molecular structures. The model is trained in three stages: 3D molecule-text representation learning, 3D molecule-text alignment via generative learning, and 3D molecule-centric instruction tuning. The 3D-MoIT dataset, curated from PubChem and PubChemQC, enhances the model's ability to follow human instructions and understand 3D-dependent molecular properties. Extensive experiments demonstrate that 3D-MoLM outperforms existing baselines in molecule-text retrieval, molecule captioning, and open-text molecular QA tasks, particularly on 3D-dependent properties. The paper also discusses limitations and future directions, including the need for larger datasets and exploring other capabilities of large LMs.The paper introduces 3D-MoLM, a novel framework for 3D molecule-text interpretation, which enables language models (LMs) to interpret and analyze 3D molecular structures through text generation. The key components of 3D-MoLM include a 3D molecule-text projector and a 3D molecular encoder, which align 3D molecular representations with the LM's input space. The 3D molecule-text projector, inspired by vision-language models, uses Q-Former to map 3D molecular representations into the LM's textual space. The 3D molecular encoder, Uni-Mol, encodes 3D molecular structures. The model is trained in three stages: 3D molecule-text representation learning, 3D molecule-text alignment via generative learning, and 3D molecule-centric instruction tuning. The 3D-MoIT dataset, curated from PubChem and PubChemQC, enhances the model's ability to follow human instructions and understand 3D-dependent molecular properties. Extensive experiments demonstrate that 3D-MoLM outperforms existing baselines in molecule-text retrieval, molecule captioning, and open-text molecular QA tasks, particularly on 3D-dependent properties. The paper also discusses limitations and future directions, including the need for larger datasets and exploring other capabilities of large LMs.

3D-MoLM: TOWARDS 3D MOLECULE-TEXT INTERPRETATION IN LANGUAGE MODELS

17 Mar 2024 | Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, Qi Tian