July 17, 2024 | Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S. Song
Genomic Language Models (gLMs) are large language models trained on DNA sequences, offering significant potential to advance our understanding of genomes and DNA interactions. This review highlights key applications of gLMs, including fitness prediction, sequence design, and transfer learning. Despite recent progress, developing effective and efficient gLMs remains challenging, especially for species with large, complex genomes. The paper discusses major considerations for developing and evaluating gLMs, emphasizing the importance of data quality and quantity, model architecture, learning objectives, interpretation, and evaluation methods. It also addresses the challenges of generalization and the need for further research to model patterns across various scales. The authors conclude by outlining outstanding questions and future perspectives, emphasizing the importance of integrating deep genomics expertise to maximize the utility of gLMs.Genomic Language Models (gLMs) are large language models trained on DNA sequences, offering significant potential to advance our understanding of genomes and DNA interactions. This review highlights key applications of gLMs, including fitness prediction, sequence design, and transfer learning. Despite recent progress, developing effective and efficient gLMs remains challenging, especially for species with large, complex genomes. The paper discusses major considerations for developing and evaluating gLMs, emphasizing the importance of data quality and quantity, model architecture, learning objectives, interpretation, and evaluation methods. It also addresses the challenges of generalization and the need for further research to model patterns across various scales. The authors conclude by outlining outstanding questions and future perspectives, emphasizing the importance of integrating deep genomics expertise to maximize the utility of gLMs.