Structured information extraction from scientific text with large language models

Structured information extraction from scientific text with large language models

15 February 2024 | John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, Anubhav Jain
This paper presents a method for extracting structured knowledge from scientific text using large language models (LLMs) such as GPT-3 and LLaMA-2. The approach combines named entity recognition (NER) and relation extraction (RE) to extract complex scientific information from materials chemistry texts. The method is tested on three tasks: linking dopants and host materials, cataloging metal-organic frameworks (MOFs), and extracting general composition/phase/morphology/application information. The output can be formatted as simple English sentences or structured JSON objects. The paper highlights the challenges of traditional NER and RE methods in handling intricate and hierarchical relationships between entities, and demonstrates that LLMs can effectively address these challenges. The method is flexible, accessible, and can be applied to various scientific domains. The results show that GPT-3 and LLaMA-2 achieve high precision, recall, and F1 scores on the tested tasks, with GPT-3 performing slightly better. The paper also discusses the benefits of using a human-in-the-loop annotation process to improve efficiency and the limitations of the approach, such as the need for a large number of training samples and the potential for "hallucination" in LLM outputs. Overall, the proposed method provides a powerful and accessible solution for extracting structured scientific knowledge from unstructured text.This paper presents a method for extracting structured knowledge from scientific text using large language models (LLMs) such as GPT-3 and LLaMA-2. The approach combines named entity recognition (NER) and relation extraction (RE) to extract complex scientific information from materials chemistry texts. The method is tested on three tasks: linking dopants and host materials, cataloging metal-organic frameworks (MOFs), and extracting general composition/phase/morphology/application information. The output can be formatted as simple English sentences or structured JSON objects. The paper highlights the challenges of traditional NER and RE methods in handling intricate and hierarchical relationships between entities, and demonstrates that LLMs can effectively address these challenges. The method is flexible, accessible, and can be applied to various scientific domains. The results show that GPT-3 and LLaMA-2 achieve high precision, recall, and F1 scores on the tested tasks, with GPT-3 performing slightly better. The paper also discusses the benefits of using a human-in-the-loop annotation process to improve efficiency and the limitations of the approach, such as the need for a large number of training samples and the potential for "hallucination" in LLM outputs. Overall, the proposed method provides a powerful and accessible solution for extracting structured scientific knowledge from unstructured text.
Reach us at info@study.space
[slides and audio] Structured information extraction from scientific text with large language models