23 July 2024 | Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, Anna R. Poetsch
The paper introduces GROVER, a deep-learning model designed to understand the genetic code and sequence context in the human genome. GROVER is trained using byte-pair encoding on the human genome and a custom task, next-k-mer prediction, to select an optimal vocabulary. The model learns to encode information related to frequency, sequence content, and length, and it also captures context and lexical ambiguity. GROVER outperforms other models in fine-tuning tasks addressing genome biology, such as promoter identification, promoter scanning, and protein-DNA binding prediction. The study highlights the potential of GROVER to extract complex information from the genome, including functional genomics annotations and structural features, and suggests that it can be used to develop a grammar book for the code of life.The paper introduces GROVER, a deep-learning model designed to understand the genetic code and sequence context in the human genome. GROVER is trained using byte-pair encoding on the human genome and a custom task, next-k-mer prediction, to select an optimal vocabulary. The model learns to encode information related to frequency, sequence content, and length, and it also captures context and lexical ambiguity. GROVER outperforms other models in fine-tuning tasks addressing genome biology, such as promoter identification, promoter scanning, and protein-DNA binding prediction. The study highlights the potential of GROVER to extract complex information from the genome, including functional genomics annotations and structural features, and suggests that it can be used to develop a grammar book for the code of life.