[slides] Scientific Large Language Models%3A A Survey on Biological %26 Chemical Domains

The paper "Scientific Large Language Models: A Survey on Biological & Chemical Domains" by Qiang Zhang et al. provides a comprehensive review of large language models (LLMs) designed for scientific domains, particularly focusing on biological and chemical languages. The authors introduce the concept of Scientific Large Language Models (Sci-LLMs), which are specialized LLMs trained to understand, interpret, and generate scientific languages. These models are categorized into textual, molecular, protein, genomic, and multimodal types, each tailored for specific scientific tasks. The paper begins with an introduction to the background of LLMs and the formulation of scientific languages, including molecular, protein, and genomic languages. It then delves into the architecture of Sci-LLMs, categorizing them into encoder-only, decoder-only, and encoder-decoder models. The pre-training and fine-tuning processes are also discussed, emphasizing the importance of diverse corpora and specific datasets for each domain. The main sections of the paper focus on different types of Sci-LLMs: 1. **Textual Scientific Large Language Models (Text-Sci-LLMs)**: These models are trained on textual scientific data and are used in medical, biological, chemical, and comprehensive domains. They excel in tasks such as understanding and generating scientific text. 2. **Molecular Large Language Models (Mol-LLMs)**: Specialized for molecular data, these models are valuable in drug discovery, materials science, and understanding chemical interactions. 3. **Protein Large Language Models (Prot-LLMs)**: Trained on protein-related data, these models can predict protein structures, functions, and interactions. 4. **Genomic Large Language Models (Gene-LLMs)**: Focused on genomic data, these models help in analyzing DNA sequences, understanding genetic variations, and supporting genetic research. 5. **Multimodal Scientific Large Language Models (MM-Sci-LLMs)**: These advanced models can handle multiple types of scientific data, making them suitable for interdisciplinary research. The paper also outlines the datasets and evaluation benchmarks used for training and assessing these models, highlighting the importance of specific resources and metrics. Finally, it discusses the limitations of existing models and suggests future research directions, emphasizing the need for more comprehensive and specialized models to address the unique challenges of scientific languages.The paper "Scientific Large Language Models: A Survey on Biological & Chemical Domains" by Qiang Zhang et al. provides a comprehensive review of large language models (LLMs) designed for scientific domains, particularly focusing on biological and chemical languages. The authors introduce the concept of Scientific Large Language Models (Sci-LLMs), which are specialized LLMs trained to understand, interpret, and generate scientific languages. These models are categorized into textual, molecular, protein, genomic, and multimodal types, each tailored for specific scientific tasks. The paper begins with an introduction to the background of LLMs and the formulation of scientific languages, including molecular, protein, and genomic languages. It then delves into the architecture of Sci-LLMs, categorizing them into encoder-only, decoder-only, and encoder-decoder models. The pre-training and fine-tuning processes are also discussed, emphasizing the importance of diverse corpora and specific datasets for each domain. The main sections of the paper focus on different types of Sci-LLMs: 1. **Textual Scientific Large Language Models (Text-Sci-LLMs)**: These models are trained on textual scientific data and are used in medical, biological, chemical, and comprehensive domains. They excel in tasks such as understanding and generating scientific text. 2. **Molecular Large Language Models (Mol-LLMs)**: Specialized for molecular data, these models are valuable in drug discovery, materials science, and understanding chemical interactions. 3. **Protein Large Language Models (Prot-LLMs)**: Trained on protein-related data, these models can predict protein structures, functions, and interactions. 4. **Genomic Large Language Models (Gene-LLMs)**: Focused on genomic data, these models help in analyzing DNA sequences, understanding genetic variations, and supporting genetic research. 5. **Multimodal Scientific Large Language Models (MM-Sci-LLMs)**: These advanced models can handle multiple types of scientific data, making them suitable for interdisciplinary research. The paper also outlines the datasets and evaluation benchmarks used for training and assessing these models, highlighting the importance of specific resources and metrics. Finally, it discusses the limitations of existing models and suggests future research directions, emphasizing the need for more comprehensive and specialized models to address the unique challenges of scientific languages.