DNA language model GROVER learns sequence context in the human genome

DNA language model GROVER learns sequence context in the human genome

August 2024 | Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, Anna R. Poetsch
The article introduces GROVER, a DNA language model trained on the human genome to learn sequence context and structure. GROVER uses byte-pair encoding (BPE) to create a vocabulary that captures the information content of the genome. It is trained on masked token prediction and fine-tuned for genome biology tasks, outperforming other models in tasks like promoter identification and protein-DNA binding. GROVER learns token characteristics, sequence context, and functional genomics annotations, enabling it to extract meaningful information from the genome. The model's embeddings reflect learned content beyond simple token identity, showing correlations with GC content, AG content, token length, and other genomic features. GROVER's performance in fine-tuning tasks, such as promoter classification and CTCF binding prediction, is superior to other models, demonstrating its ability to learn biological information directly from sequence. The study highlights the potential of GROVER to advance genome biology by learning complex sequence contexts and functional annotations. The model's success underscores the importance of sequence context in genome function and the potential of deep learning to uncover hidden patterns in the genetic code.The article introduces GROVER, a DNA language model trained on the human genome to learn sequence context and structure. GROVER uses byte-pair encoding (BPE) to create a vocabulary that captures the information content of the genome. It is trained on masked token prediction and fine-tuned for genome biology tasks, outperforming other models in tasks like promoter identification and protein-DNA binding. GROVER learns token characteristics, sequence context, and functional genomics annotations, enabling it to extract meaningful information from the genome. The model's embeddings reflect learned content beyond simple token identity, showing correlations with GC content, AG content, token length, and other genomic features. GROVER's performance in fine-tuning tasks, such as promoter classification and CTCF binding prediction, is superior to other models, demonstrating its ability to learn biological information directly from sequence. The study highlights the potential of GROVER to advance genome biology by learning complex sequence contexts and functional annotations. The model's success underscores the importance of sequence context in genome function and the potential of deep learning to uncover hidden patterns in the genetic code.
Reach us at info@study.space
Understanding DNA language model GROVER learns sequence context in the human genome