2024 | Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
The paper introduces Caduceus, a novel architecture for bi-directional and reverse complement (RC) equivariant long-range DNA sequence modeling. It addresses the challenges of modeling genomic sequences, such as long-range token interactions, upstream and downstream effects, and RC properties. The architecture is built on the long-range Mamba block, extended to support bi-directionality with BiMamba and RC equivariance with MambaDNA. Caduceus is the first family of RC-equivariant bi-directional long-range DNA language models and outperforms previous models on various downstream benchmarks, including a challenging variant effect prediction task. The paper also introduces pre-training and fine-tuning strategies for Caduceus foundation models, demonstrating superior performance over larger Transformer-based models that do not leverage bi-directionality or equivariance.The paper introduces Caduceus, a novel architecture for bi-directional and reverse complement (RC) equivariant long-range DNA sequence modeling. It addresses the challenges of modeling genomic sequences, such as long-range token interactions, upstream and downstream effects, and RC properties. The architecture is built on the long-range Mamba block, extended to support bi-directionality with BiMamba and RC equivariance with MambaDNA. Caduceus is the first family of RC-equivariant bi-directional long-range DNA language models and outperforms previous models on various downstream benchmarks, including a challenging variant effect prediction task. The paper also introduces pre-training and fine-tuning strategies for Caduceus foundation models, demonstrating superior performance over larger Transformer-based models that do not leverage bi-directionality or equivariance.