July 17, 2024 | Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S. Song
Genomic Language Models (gLMs) are large language models trained on DNA sequences, offering potential to advance genomic understanding and function prediction. This review highlights key applications of gLMs, including fitness prediction, sequence design, and transfer learning, while discussing challenges in developing effective and efficient models, especially for species with complex genomes. gLMs can predict the fitness of genetic variants by analyzing log-likelihood ratios, aiding in identifying deleterious mutations. They also enable the design of novel biological sequences, such as promoters and enhancers, and facilitate transfer learning across different genomic tasks. However, challenges remain in data quality, training data selection, and model architecture for handling long genomic sequences. gLMs show promise in genomics, but further research is needed to improve their performance and interpretability. The review also outlines outstanding questions for future research, including modeling long-range interactions, integrating structural variations, and understanding the scaling hypothesis for gLMs. Overall, gLMs represent a valuable tool for genomics, but their development requires careful consideration of data, architecture, and evaluation methods.Genomic Language Models (gLMs) are large language models trained on DNA sequences, offering potential to advance genomic understanding and function prediction. This review highlights key applications of gLMs, including fitness prediction, sequence design, and transfer learning, while discussing challenges in developing effective and efficient models, especially for species with complex genomes. gLMs can predict the fitness of genetic variants by analyzing log-likelihood ratios, aiding in identifying deleterious mutations. They also enable the design of novel biological sequences, such as promoters and enhancers, and facilitate transfer learning across different genomic tasks. However, challenges remain in data quality, training data selection, and model architecture for handling long genomic sequences. gLMs show promise in genomics, but further research is needed to improve their performance and interpretability. The review also outlines outstanding questions for future research, including modeling long-range interactions, integrating structural variations, and understanding the scaling hypothesis for gLMs. Overall, gLMs represent a valuable tool for genomics, but their development requires careful consideration of data, architecture, and evaluation methods.