Understanding From Words to Molecules%3A A Survey of Large Language Models in Chemistry

This paper provides a comprehensive survey of the integration of Large Language Models (LLMs) into the field of chemistry, exploring the methodologies, challenges, and applications. The authors categorize chemical LLMs into three groups based on the domain and modality of their input data: single-domain, multi-domain, and multi-modal approaches. They discuss various tokenization methods for molecular sequences, including character-level, atom-level, and motif-level approaches, and delve into pretraining objectives such as Masked Language Modeling (MLM), Molecule Property Prediction (MPP), and Autoregressive Token Generation (ATG). The paper also examines the diverse applications of LLMs in chemistry, including chatbot-like functionalities, in-context learning, and representation learning for downstream tasks. Finally, it identifies promising future research directions, such as further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, to advance the field of chemical LLMs.This paper provides a comprehensive survey of the integration of Large Language Models (LLMs) into the field of chemistry, exploring the methodologies, challenges, and applications. The authors categorize chemical LLMs into three groups based on the domain and modality of their input data: single-domain, multi-domain, and multi-modal approaches. They discuss various tokenization methods for molecular sequences, including character-level, atom-level, and motif-level approaches, and delve into pretraining objectives such as Masked Language Modeling (MLM), Molecule Property Prediction (MPP), and Autoregressive Token Generation (ATG). The paper also examines the diverse applications of LLMs in chemistry, including chatbot-like functionalities, in-context learning, and representation learning for downstream tasks. Finally, it identifies promising future research directions, such as further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, to advance the field of chemical LLMs.

From Words to Molecules: A Survey of Large Language Models in Chemistry

2 Feb 2024 | Chang Liao, Yemin Yu, Yu Mei, Ying Wei