Sequence modeling and design from molecular to genome scale with Evo

Sequence modeling and design from molecular to genome scale with Evo

February 27, 2024 | Eric Nguyen*,1,2, Michael Poli*,3, Matthew G. Durrant*,2, Armin W. Thomas1, Brian Kang1, Jeremy Sullivan2, Madelena Y. Ng1, Ashley Lewis1, Aman Patel1, Aaron Lou1, Stefano Ermon1,4, Stephen A. Baccus1, Tina Hernandez-Boussard1, Christopher Ré1, Patrick D. Hsu†2,5, and Brian L. Hie†,1,2
Evo is a genomic foundation model trained on hundreds of billions of DNA tokens across prokaryotic genomes, capable of predicting and generating DNA sequences at the molecular, systems, and genome scales. The model, based on the StripedHyena architecture, achieves single-nucleotide resolution with a context length of 131,000 tokens. Evo demonstrates superior performance in zero-shot function prediction for proteins, non-coding RNAs, and regulatory DNA, outperforming specialized models. It also excels in generative tasks, such as designing CRISPR-Cas molecular complexes and transposable elements, and predicting gene essentiality at the nucleotide level. Evo can generate coding-rich sequences up to 650,000 base pairs in length, significantly longer than previous methods. The model's capabilities open new avenues for advancing biological understanding and engineering, while also raising important biosafety and ethical considerations.Evo is a genomic foundation model trained on hundreds of billions of DNA tokens across prokaryotic genomes, capable of predicting and generating DNA sequences at the molecular, systems, and genome scales. The model, based on the StripedHyena architecture, achieves single-nucleotide resolution with a context length of 131,000 tokens. Evo demonstrates superior performance in zero-shot function prediction for proteins, non-coding RNAs, and regulatory DNA, outperforming specialized models. It also excels in generative tasks, such as designing CRISPR-Cas molecular complexes and transposable elements, and predicting gene essentiality at the nucleotide level. Evo can generate coding-rich sequences up to 650,000 base pairs in length, significantly longer than previous methods. The model's capabilities open new avenues for advancing biological understanding and engineering, while also raising important biosafety and ethical considerations.
Reach us at info@study.space
[slides and audio] Sequence modeling and design from molecular to genome scale with Evo