2 Feb 2024 | Chang Liao, Yemin Yu, Yu Mei and Ying Wei
This paper provides a comprehensive survey of the application of Large Language Models (LLMs) in chemistry, exploring the methodologies, challenges, and future directions of integrating LLMs into the field. The paper begins by discussing the representation and tokenization of molecular data, categorizing chemical LLMs into three groups based on their input data. It then delves into the pretraining objectives tailored for chemical LLMs, including masked language modeling, molecule property prediction, and autoregressive token generation. The paper also examines the diverse applications of LLMs in chemistry, such as chatbots, in-context learning, and representation learning, highlighting their potential for tasks like molecule property prediction, reaction prediction, and molecule generation. The study identifies promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability. The paper emphasizes the need for more sophisticated approaches to address the unique challenges of chemical data, such as the complex semantics of chemical languages and the need for domain-specific knowledge. It concludes by outlining future directions for the development of chemical LLMs, including the integration of quantum chemistry knowledge and the enhancement of model interpretability. The survey highlights the potential of LLMs in advancing chemical research and development, while also acknowledging the challenges that remain to be addressed.This paper provides a comprehensive survey of the application of Large Language Models (LLMs) in chemistry, exploring the methodologies, challenges, and future directions of integrating LLMs into the field. The paper begins by discussing the representation and tokenization of molecular data, categorizing chemical LLMs into three groups based on their input data. It then delves into the pretraining objectives tailored for chemical LLMs, including masked language modeling, molecule property prediction, and autoregressive token generation. The paper also examines the diverse applications of LLMs in chemistry, such as chatbots, in-context learning, and representation learning, highlighting their potential for tasks like molecule property prediction, reaction prediction, and molecule generation. The study identifies promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability. The paper emphasizes the need for more sophisticated approaches to address the unique challenges of chemical data, such as the complex semantics of chemical languages and the need for domain-specific knowledge. It concludes by outlining future directions for the development of chemical LLMs, including the integration of quantum chemistry knowledge and the enhancement of model interpretability. The survey highlights the potential of LLMs in advancing chemical research and development, while also acknowledging the challenges that remain to be addressed.