A foundational large language model for edible plant genomes

A foundational large language model for edible plant genomes

2024 | Javier Mendoza-Revilla, Evan Trop, Liam Gonzalez, Maša Roller, Hugo Dalla-Torre, Bernardo P. de Almeida, Guillaume Richard, Jonathan Caton, Nicolas Lopez Carranza, Marcin Skwark, Alex Laterre, Karim Beguir, Thomas Pierrot & Marie Lopez
A foundational large language model for edible plant genomes AgroNT is a large language model trained on genomes from 48 plant species, with a focus on edible plants. It can predict regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. The model was evaluated using in silico saturation mutagenesis on cassava, analyzing over 10 million mutations. The results were used to create the Plants Genomic Benchmark (PGB), a comprehensive benchmark for deep learning-based methods in plant genomics. The pre-trained AgroNT model is available on HuggingFace for future research. The advent of high-throughput next-generation sequencing has led to a vast increase in genomic data in plant sciences. Since the completion of the genome sequence of Arabidopsis thaliana over 20 years ago, more than 200 plant species' genome sequences have been published. However, the generation of a species' assembly is only the initial step in understanding the genome. Additional experiments and computational processing are necessary for structural and functional annotation of important genomic regions. Many plant species lack sufficient experimental resources, including 'orphan crops' important for regional food and economic security. These species lack sufficient transcriptomic, regulatory, or proteomic experiments, limiting understanding of growth, senescence, yield, and responses to stresses. This limits the use of modern improvement tools such as high-throughput phenotyping, genomic selection, and genome editing. Novel approaches that can accurately predict gene annotations and regulatory genomic features directly from DNA sequences have the potential to provide valuable biological insights and assist in genomic editing applications. The complexity of sequence determinants of gene structure and regulatory features makes end-to-end deep learning-based approaches highly suitable for learning directly from DNA sequences to achieve accurate predictions of specific outcomes. Most deep learning approaches have relied on supervised learning, which depends on abundant labeled data, leading to sub-optimal performance and limited usability in data-scarce scenarios. Self-supervised learning, where a model is first trained on a large unlabeled corpus and then fine-tuned on supervised tasks, has shown success in natural language processing. Models such as BERT and GPT have gained traction in biology. These models can be trained on unlabeled data and generate versatile representations capable of solving specific tasks. LMs overcome the limitation of supervised learning by not relying on single reference genomes, which often provide an incomplete and biased genomic diversity depiction. LMs can leverage multiple reference genomes, including those from genetically distant species, thereby increasing overall diversity, which has been shown to significantly enhance prediction performance. This diversity is particularly relevant in plant species due to the structural complexities of their genomes. LMs are also well-suited for zero-shot learning, a transfer learning approach that enables the model to recognize and classify samples from new classes not encountered during training. Zero-shot predictions represent an alternative approach to the traditional method of training supervised models on large amounts of functional genomic data. Since LMs areA foundational large language model for edible plant genomes AgroNT is a large language model trained on genomes from 48 plant species, with a focus on edible plants. It can predict regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. The model was evaluated using in silico saturation mutagenesis on cassava, analyzing over 10 million mutations. The results were used to create the Plants Genomic Benchmark (PGB), a comprehensive benchmark for deep learning-based methods in plant genomics. The pre-trained AgroNT model is available on HuggingFace for future research. The advent of high-throughput next-generation sequencing has led to a vast increase in genomic data in plant sciences. Since the completion of the genome sequence of Arabidopsis thaliana over 20 years ago, more than 200 plant species' genome sequences have been published. However, the generation of a species' assembly is only the initial step in understanding the genome. Additional experiments and computational processing are necessary for structural and functional annotation of important genomic regions. Many plant species lack sufficient experimental resources, including 'orphan crops' important for regional food and economic security. These species lack sufficient transcriptomic, regulatory, or proteomic experiments, limiting understanding of growth, senescence, yield, and responses to stresses. This limits the use of modern improvement tools such as high-throughput phenotyping, genomic selection, and genome editing. Novel approaches that can accurately predict gene annotations and regulatory genomic features directly from DNA sequences have the potential to provide valuable biological insights and assist in genomic editing applications. The complexity of sequence determinants of gene structure and regulatory features makes end-to-end deep learning-based approaches highly suitable for learning directly from DNA sequences to achieve accurate predictions of specific outcomes. Most deep learning approaches have relied on supervised learning, which depends on abundant labeled data, leading to sub-optimal performance and limited usability in data-scarce scenarios. Self-supervised learning, where a model is first trained on a large unlabeled corpus and then fine-tuned on supervised tasks, has shown success in natural language processing. Models such as BERT and GPT have gained traction in biology. These models can be trained on unlabeled data and generate versatile representations capable of solving specific tasks. LMs overcome the limitation of supervised learning by not relying on single reference genomes, which often provide an incomplete and biased genomic diversity depiction. LMs can leverage multiple reference genomes, including those from genetically distant species, thereby increasing overall diversity, which has been shown to significantly enhance prediction performance. This diversity is particularly relevant in plant species due to the structural complexities of their genomes. LMs are also well-suited for zero-shot learning, a transfer learning approach that enables the model to recognize and classify samples from new classes not encountered during training. Zero-shot predictions represent an alternative approach to the traditional method of training supervised models on large amounts of functional genomic data. Since LMs are
Reach us at info@study.space
[slides and audio] A foundational large language model for edible plant genomes