Understanding Leveraging language representation for materials exploration and discovery

This paper introduces a novel framework for materials exploration and discovery that leverages natural language embeddings from language models to represent compositional and structural features of materials. The framework aims to address the challenge of navigating the vast materials search space by encoding contextual knowledge into compact vector representations. The effectiveness of this approach is demonstrated through its application to thermoelectric (TE) materials, where it successfully identifies structurally diverse and under-studied candidates with high potential for TE performance. The framework employs a funnel architecture consisting of a recall step and a ranking step. In the recall step, candidate materials are generated based on their similarity to a query material using cosine similarity in the representation space. In the ranking step, multi-task learning with a multi-gate mixture-of-experts (MMoE) model is used to predict multiple material properties, enhancing the accuracy and reliability of the recommendations. The results show that the language-based representations are highly effective in recalling relevant material candidates and predicting their properties, achieving performance levels comparable to state-of-the-art specialized machine learning models. The framework's ability to explore under-explored material spaces and identify promising candidates for further investigation is validated through first-principles calculations and experimental validation.This paper introduces a novel framework for materials exploration and discovery that leverages natural language embeddings from language models to represent compositional and structural features of materials. The framework aims to address the challenge of navigating the vast materials search space by encoding contextual knowledge into compact vector representations. The effectiveness of this approach is demonstrated through its application to thermoelectric (TE) materials, where it successfully identifies structurally diverse and under-studied candidates with high potential for TE performance. The framework employs a funnel architecture consisting of a recall step and a ranking step. In the recall step, candidate materials are generated based on their similarity to a query material using cosine similarity in the representation space. In the ranking step, multi-task learning with a multi-gate mixture-of-experts (MMoE) model is used to predict multiple material properties, enhancing the accuracy and reliability of the recommendations. The results show that the language-based representations are highly effective in recalling relevant material candidates and predicting their properties, achieving performance levels comparable to state-of-the-art specialized machine learning models. The framework's ability to explore under-explored material spaces and identify promising candidates for further investigation is validated through first-principles calculations and experimental validation.

Leveraging language representation for materials exploration and discovery

2024 | Jiaxing Qu, Yuxuan Richard Xie, Kamil M. Ciesielski, Claire E. Porter, Eric S. Toberer, Elif Ertekin