2024 | Javier Mendoza-Revilla, Evan Trop, Liam Gonzalez, Maša Roller, Hugo Dalla-Torre, Bernardo P. de Almeida, Guillaume Richard, Jonathan Caton, Nicolas Lopez Carranza, Marcin Skward, Alex Laterre, Karim Beguir, Thomas Pierrot, Marie Lopez
This paper introduces AgroNT, a large language model (LLM) trained on the genomes of 48 plant species, primarily focusing on edible and crop species. AgroNT is designed to predict regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. The model demonstrates state-of-the-art performance in these tasks and is particularly effective in zero-shot learning, where it can predict the impact of genetic variants in understudied or orphan crops. The authors also perform a large-scale in silico mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations, providing predicted effects for future research. Additionally, they propose the Plants Genomic Benchmark (PGB) to serve as a comprehensive benchmark for deep learning methods in plant genomic research. The pre-trained AgroNT model is publicly available for future research purposes. The study highlights the significant capabilities of AgroNT in plant genomics and its potential for improving genomic editing and breeding engineering.This paper introduces AgroNT, a large language model (LLM) trained on the genomes of 48 plant species, primarily focusing on edible and crop species. AgroNT is designed to predict regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. The model demonstrates state-of-the-art performance in these tasks and is particularly effective in zero-shot learning, where it can predict the impact of genetic variants in understudied or orphan crops. The authors also perform a large-scale in silico mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations, providing predicted effects for future research. Additionally, they propose the Plants Genomic Benchmark (PGB) to serve as a comprehensive benchmark for deep learning methods in plant genomic research. The pre-trained AgroNT model is publicly available for future research purposes. The study highlights the significant capabilities of AgroNT in plant genomics and its potential for improving genomic editing and breeding engineering.