February 27, 2024 | Eric Nguyen*, Michael Poli*, Matthew G. Durrant*, Armin W. Thomas1, Brian Kang1, Jeremy Sullivan2, Madelena Y. Ng1, Ashley Lewis1, Aman Patel1, Aaron Lou1, Stefano Ermon1,4, Stephen A. Baccus1, Tina Hernandez-Boussard1, Christopher Ré1, Patrick D. Hsu†,2,5, and Brian L. Hie†,1,2
Evo is a genomic foundation model that enables prediction and generation tasks from the molecular to genome scale. Trained on prokaryotic whole-genome data, Evo uses a context length of 131 kilobases (kb) and is based on the StripedHyena architecture, which combines attention and data-controlled convolutional operators for efficient processing of long sequences. Evo can predict gene essentiality at nucleotide resolution and generate coding-rich sequences up to 650 kb in length, far exceeding previous methods. It excels at multilevel generation tasks, such as generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Evo also performs zero-shot function prediction across DNA, RNA, and protein modalities, outperforming specialized models in predicting mutational effects on noncoding RNAs and gene expression from regulatory DNA. Evo can generate genome-scale sequences with dense coding architecture, recapitulating key characteristics of natural genomes. It is capable of predicting gene essentiality across diverse bacterial and phage genomes, with performance significantly improved by providing additional genomic context. Evo is also a generative model that can sample CRISPR-Cas proteins and their noncoding guide RNAs, multi-gene transposable systems, and sequences that recapitulate the coding organization of real genomes. Evo is made publicly available as open-source code and models. Despite its capabilities, Evo has limitations, including its focus on prokaryotic data and the need for further research to address technical challenges and ethical considerations. Evo represents a promising foundation for improving our understanding and control of biology across multiple levels of complexity.Evo is a genomic foundation model that enables prediction and generation tasks from the molecular to genome scale. Trained on prokaryotic whole-genome data, Evo uses a context length of 131 kilobases (kb) and is based on the StripedHyena architecture, which combines attention and data-controlled convolutional operators for efficient processing of long sequences. Evo can predict gene essentiality at nucleotide resolution and generate coding-rich sequences up to 650 kb in length, far exceeding previous methods. It excels at multilevel generation tasks, such as generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Evo also performs zero-shot function prediction across DNA, RNA, and protein modalities, outperforming specialized models in predicting mutational effects on noncoding RNAs and gene expression from regulatory DNA. Evo can generate genome-scale sequences with dense coding architecture, recapitulating key characteristics of natural genomes. It is capable of predicting gene essentiality across diverse bacterial and phage genomes, with performance significantly improved by providing additional genomic context. Evo is also a generative model that can sample CRISPR-Cas proteins and their noncoding guide RNAs, multi-gene transposable systems, and sequences that recapitulate the coding organization of real genomes. Evo is made publicly available as open-source code and models. Despite its capabilities, Evo has limitations, including its focus on prokaryotic data and the need for further research to address technical challenges and ethical considerations. Evo represents a promising foundation for improving our understanding and control of biology across multiple levels of complexity.