4 Jul 2024 | Carl Edwards¹, Qingyun Wang¹, Lawrence Zhao² and Heng Ji¹
The L+M-24 dataset was created to address the challenge of training language-molecule models due to the scarcity of molecule-language pair datasets. This dataset focuses on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. It is designed for the Language + Molecules Workshop at ACL 2024 and includes four categories: Biomedical, Light and Electricity, Human Interaction and Organoleptics, and Agriculture and Industry. The dataset was constructed using three different databases: PubChem, Chemical Function (CheF), and ChemFOnt. Properties were extracted from these sources and converted to natural language using templates generated by GPT-4. The dataset includes 160,492 molecule-description pairs for training and 21,839 pairs for evaluation. The evaluation metrics include FTS, uniqueness, and property-specific precision, recall, and F-1 scores. The dataset proved challenging for existing models, particularly in capturing rare properties and molecule-protein interactions. Future work may involve incorporating other modalities, such as proteins, and improving evaluation metrics. The dataset will be used as a shared task at the First Language + Molecules Workshop at ACL 2024.The L+M-24 dataset was created to address the challenge of training language-molecule models due to the scarcity of molecule-language pair datasets. This dataset focuses on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. It is designed for the Language + Molecules Workshop at ACL 2024 and includes four categories: Biomedical, Light and Electricity, Human Interaction and Organoleptics, and Agriculture and Industry. The dataset was constructed using three different databases: PubChem, Chemical Function (CheF), and ChemFOnt. Properties were extracted from these sources and converted to natural language using templates generated by GPT-4. The dataset includes 160,492 molecule-description pairs for training and 21,839 pairs for evaluation. The evaluation metrics include FTS, uniqueness, and property-specific precision, recall, and F-1 scores. The dataset proved challenging for existing models, particularly in capturing rare properties and molecule-protein interactions. Future work may involve incorporating other modalities, such as proteins, and improving evaluation metrics. The dataset will be used as a shared task at the First Language + Molecules Workshop at ACL 2024.