Leveraging language representation for materials exploration and discovery

Leveraging language representation for materials exploration and discovery

2024 | Jiaxing Qu, Yuxuan Richard Xie, Kamil M. Ciesielski, Claire E. Porter, Eric S. Toberer & Elif Ertekin
This study introduces a materials discovery framework that leverages natural language embeddings from language models to represent compositional and structural features of materials. The framework enables similarity analysis to recall relevant materials based on a query and multi-task learning to share information across related properties. Applied to thermoelectrics, it identifies under-studied material spaces and validates promising candidates through first-principles calculations and experiments. Language-based frameworks offer versatile and adaptable embeddings for effective materials exploration and discovery across diverse systems. The goal of inorganic materials discovery is to efficiently navigate the materials space and identify candidates with targeted properties. Challenges include growing complexity and varied mappings from material space to objective space. Ab-initio methods provide accurate insights, while machine learning (ML) has emerged as a promising tool to expedite discovery workflows. However, defining universal model input representations remains a key challenge. A key challenge in ML for materials search is defining universal model input representations. An ideal representation should enable conversion of inorganic crystals into a machine-readable format and capture complexities like defects, alloying, and disorder. Early ML models used hand-crafted descriptors, while recent approaches treat material atomic structures as graphs. However, both approaches have limitations in providing universal and task-agnostic representations. This work assesses the effectiveness of language representations in tackling general materials discovery tasks. Advances in natural language processing have enabled extraction of valuable information from materials science literature. Contextual embeddings, enhanced by transformer models, allow encoding of domain knowledge into compact, information-rich vector representations. A pioneering study demonstrated that word embeddings can capture underlying knowledge in materials science and be applied for tasks like materials search and ranking. A materials recommendation framework utilizing language representations is presented. This framework allows identification of similar candidates given a query material with targeted properties. The framework uses a funnel architecture with two steps: candidate generation (recall) and property evaluation (ranking). Through evaluation of different embedding methods across downstream tasks, language representations are shown to be highly effective in recalling relevant material candidates and predicting material properties. The framework is applied to search for and recommend high-performance thermoelectric (TE) materials. It successfully identifies structurally-diversified TE candidates and under-explored material spaces. Validation through first-principles calculations and experiments confirms the potential of the recommended materials as high-performance TE materials. The study demonstrates that language representations can effectively capture material composition, structure, and properties. Six embedding methods were evaluated, with Mat2Vec and fingerprint as baselines for compositional and structural embeddings. The results show that language representations outperform traditional methods in capturing latent material science knowledge. Multi-task learning is used to improve property predictions by exploiting cross-task correlations. The MMoE model, a multi-task learning strategy, leverages correlations between material property prediction tasks. This approach demonstrates that pre-existing knowledge in the latent space can be effectively transferred to new tasks, resulting in faster and more effective learning. The framework is evaluated on seven representative TE materials, showing that it can effectively rank materials based on similarityThis study introduces a materials discovery framework that leverages natural language embeddings from language models to represent compositional and structural features of materials. The framework enables similarity analysis to recall relevant materials based on a query and multi-task learning to share information across related properties. Applied to thermoelectrics, it identifies under-studied material spaces and validates promising candidates through first-principles calculations and experiments. Language-based frameworks offer versatile and adaptable embeddings for effective materials exploration and discovery across diverse systems. The goal of inorganic materials discovery is to efficiently navigate the materials space and identify candidates with targeted properties. Challenges include growing complexity and varied mappings from material space to objective space. Ab-initio methods provide accurate insights, while machine learning (ML) has emerged as a promising tool to expedite discovery workflows. However, defining universal model input representations remains a key challenge. A key challenge in ML for materials search is defining universal model input representations. An ideal representation should enable conversion of inorganic crystals into a machine-readable format and capture complexities like defects, alloying, and disorder. Early ML models used hand-crafted descriptors, while recent approaches treat material atomic structures as graphs. However, both approaches have limitations in providing universal and task-agnostic representations. This work assesses the effectiveness of language representations in tackling general materials discovery tasks. Advances in natural language processing have enabled extraction of valuable information from materials science literature. Contextual embeddings, enhanced by transformer models, allow encoding of domain knowledge into compact, information-rich vector representations. A pioneering study demonstrated that word embeddings can capture underlying knowledge in materials science and be applied for tasks like materials search and ranking. A materials recommendation framework utilizing language representations is presented. This framework allows identification of similar candidates given a query material with targeted properties. The framework uses a funnel architecture with two steps: candidate generation (recall) and property evaluation (ranking). Through evaluation of different embedding methods across downstream tasks, language representations are shown to be highly effective in recalling relevant material candidates and predicting material properties. The framework is applied to search for and recommend high-performance thermoelectric (TE) materials. It successfully identifies structurally-diversified TE candidates and under-explored material spaces. Validation through first-principles calculations and experiments confirms the potential of the recommended materials as high-performance TE materials. The study demonstrates that language representations can effectively capture material composition, structure, and properties. Six embedding methods were evaluated, with Mat2Vec and fingerprint as baselines for compositional and structural embeddings. The results show that language representations outperform traditional methods in capturing latent material science knowledge. Multi-task learning is used to improve property predictions by exploiting cross-task correlations. The MMoE model, a multi-task learning strategy, leverages correlations between material property prediction tasks. This approach demonstrates that pre-existing knowledge in the latent space can be effectively transferred to new tasks, resulting in faster and more effective learning. The framework is evaluated on seven representative TE materials, showing that it can effectively rank materials based on similarity
Reach us at info@study.space