July 25, 2024 | Mara Schilling-Wilhelm, Martiño Ríos-García, Sherjeel Shabih, María Victoria Gil, Santiago Miret, Christoph T. Koch, José A. Márquez, and Kevin Maik Jablonka
This review explores the application of large language models (LLMs) in structured data extraction from materials science literature. The field traditionally relied on manual curation and partial automation for data extraction, but LLMs offer a promising alternative for efficiently extracting structured, actionable data from unstructured text. While LLMs can solve tasks without explicit training, their application in materials science requires domain knowledge to guide and validate outputs. The review provides a comprehensive overview of LLM-based data extraction workflows, including preprocessing, LLM interaction, and postprocessing. It addresses the lack of standardized guidelines and presents frameworks for leveraging the synergy between LLMs and materials science expertise. The review also discusses challenges such as data preprocessing, dealing with finite context, and the use of advanced prompting techniques. It highlights the importance of structured data in materials design and the potential of LLMs to accelerate the development of novel materials for societal needs. The review outlines the working principles of LLMs, including their training, tuning, and interaction with materials science data. It also discusses the use of vision language models (VLMs) for handling complex structures like tables and plots. The review concludes with a discussion of the future directions and challenges in applying LLMs to materials science data extraction.This review explores the application of large language models (LLMs) in structured data extraction from materials science literature. The field traditionally relied on manual curation and partial automation for data extraction, but LLMs offer a promising alternative for efficiently extracting structured, actionable data from unstructured text. While LLMs can solve tasks without explicit training, their application in materials science requires domain knowledge to guide and validate outputs. The review provides a comprehensive overview of LLM-based data extraction workflows, including preprocessing, LLM interaction, and postprocessing. It addresses the lack of standardized guidelines and presents frameworks for leveraging the synergy between LLMs and materials science expertise. The review also discusses challenges such as data preprocessing, dealing with finite context, and the use of advanced prompting techniques. It highlights the importance of structured data in materials design and the potential of LLMs to accelerate the development of novel materials for societal needs. The review outlines the working principles of LLMs, including their training, tuning, and interaction with materials science data. It also discusses the use of vision language models (VLMs) for handling complex structures like tables and plots. The review concludes with a discussion of the future directions and challenges in applying LLMs to materials science data extraction.