Structured information extraction from scientific text with large language models

Structured information extraction from scientific text with large language models

15 February 2024 | John Dagdelen, Alexander Dunn, Sanghoo Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson & Anubhav Jain
This article presents a method for extracting structured information from scientific text using large language models (LLMs). The approach involves fine-tuning LLMs such as GPT-3 and Llama-2 to simultaneously perform named entity recognition (NER) and relation extraction (RE), enabling the extraction of complex scientific knowledge from research papers. The method is tested on three tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks (MOFs), and general composition/phase/morphology/application information extraction. The extracted information can be returned as simple English sentences or structured formats like JSON objects. Scientific knowledge about solid-state materials is often scattered across text, tables, and figures in millions of research papers, making it difficult for researchers to effectively leverage existing knowledge. While databases of materials property data derived from ab initio simulations are common, they are limited to computationally accessible properties. In contrast, databases of experimental property measurements are comparatively small. Recent advances in natural language processing (NLP) have enabled the structuring of existing materials science knowledge. Most of this work has focused on NER, where entity labels such as "material" or "property" are applied to words from the text. These tagged sequences can be used to construct auto-generated tabular databases of materials property data. However, relation extraction (RE) remains a challenge, as it requires identifying relationships between named entities. The proposed method uses a sequence-to-sequence (seq2seq) approach, where a model is trained to output tuples of named entities and relation labels. This method is tested on three tasks: solid-state impurity doping, MOF information extraction, and general materials information extraction. The results show that GPT-3 achieves the highest F1 scores for the General and MOF tasks, while Llama-2 performs well on the dopant task. The method is flexible and accessible, allowing researchers to define the desired output structure and annotate text passages using that structure. The LLM is then fine-tuned on these examples, enabling it to accurately output extracted information in the same structured format. The method is shown to be effective in extracting structured knowledge from scientific text, with performance metrics indicating high accuracy in extracting meaningful relationships between entities. The approach is also shown to be effective in reducing annotation time through a "human-in-the-loop" process, where partially trained models are used to pre-annotate text, which is then corrected by human annotators. This process significantly reduces the time required to complete annotations, especially when using a large number of training samples. The method is also shown to be effective in handling complex relationships between entities, including cases where information exists as lists of multiple items. The approach is not limited to materials science and can be applied to other domains such as chemistry, health sciences, or biology. The method is also shown to be effective in reducing the need for post-processing, as error correction and normalization can be embedded directly into training examples. Overall, theThis article presents a method for extracting structured information from scientific text using large language models (LLMs). The approach involves fine-tuning LLMs such as GPT-3 and Llama-2 to simultaneously perform named entity recognition (NER) and relation extraction (RE), enabling the extraction of complex scientific knowledge from research papers. The method is tested on three tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks (MOFs), and general composition/phase/morphology/application information extraction. The extracted information can be returned as simple English sentences or structured formats like JSON objects. Scientific knowledge about solid-state materials is often scattered across text, tables, and figures in millions of research papers, making it difficult for researchers to effectively leverage existing knowledge. While databases of materials property data derived from ab initio simulations are common, they are limited to computationally accessible properties. In contrast, databases of experimental property measurements are comparatively small. Recent advances in natural language processing (NLP) have enabled the structuring of existing materials science knowledge. Most of this work has focused on NER, where entity labels such as "material" or "property" are applied to words from the text. These tagged sequences can be used to construct auto-generated tabular databases of materials property data. However, relation extraction (RE) remains a challenge, as it requires identifying relationships between named entities. The proposed method uses a sequence-to-sequence (seq2seq) approach, where a model is trained to output tuples of named entities and relation labels. This method is tested on three tasks: solid-state impurity doping, MOF information extraction, and general materials information extraction. The results show that GPT-3 achieves the highest F1 scores for the General and MOF tasks, while Llama-2 performs well on the dopant task. The method is flexible and accessible, allowing researchers to define the desired output structure and annotate text passages using that structure. The LLM is then fine-tuned on these examples, enabling it to accurately output extracted information in the same structured format. The method is shown to be effective in extracting structured knowledge from scientific text, with performance metrics indicating high accuracy in extracting meaningful relationships between entities. The approach is also shown to be effective in reducing annotation time through a "human-in-the-loop" process, where partially trained models are used to pre-annotate text, which is then corrected by human annotators. This process significantly reduces the time required to complete annotations, especially when using a large number of training samples. The method is also shown to be effective in handling complex relationships between entities, including cases where information exists as lists of multiple items. The approach is not limited to materials science and can be applied to other domains such as chemistry, health sciences, or biology. The method is also shown to be effective in reducing the need for post-processing, as error correction and normalization can be embedded directly into training examples. Overall, the
Reach us at info@study.space
[slides] Structured information extraction from scientific text with large language models | StudySpace