4 Jul 2024 | Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji
The paper introduces the $L+M-24$ dataset, designed for the Language + Molecules Workshop at ACL 2024. This dataset aims to address the scarcity of molecule-language pair datasets, which are crucial for training language-molecule models. The $L+M-24$ dataset focuses on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. It is divided into four categories: Biomedical, Light and Electricity, Human Interaction and Organoleptics, and Agriculture and Industry. The dataset is constructed using properties from PubChem, Chemical Function (CheF), and ChemFOnt, and is converted into natural language using GPT-4-generated templates. The evaluation metrics include F1 scores for property identification and molecule generation, with special attention given to rare properties and molecule-protein interactions. The paper also discusses the challenges faced by models in handling unseen property combinations and suggests future directions, including integrating other modalities and improving evaluation metrics.The paper introduces the $L+M-24$ dataset, designed for the Language + Molecules Workshop at ACL 2024. This dataset aims to address the scarcity of molecule-language pair datasets, which are crucial for training language-molecule models. The $L+M-24$ dataset focuses on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. It is divided into four categories: Biomedical, Light and Electricity, Human Interaction and Organoleptics, and Agriculture and Industry. The dataset is constructed using properties from PubChem, Chemical Function (CheF), and ChemFOnt, and is converted into natural language using GPT-4-generated templates. The evaluation metrics include F1 scores for property identification and molecule generation, with special attention given to rare properties and molecule-protein interactions. The paper also discusses the challenges faced by models in handling unseen property combinations and suggests future directions, including integrating other modalities and improving evaluation metrics.