Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

2024 | Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
Caduceus is a novel bi-directional DNA language model that supports reverse complement (RC) equivariance. The model is built upon the Mamba block, which is extended to BiMamba for bi-directionality and to MambaDNA for RC equivariance. BiMamba enables bidirectional sequence modeling by applying the Mamba module twice—once on the original sequence and once on a reversed version—while sharing projection weights to reduce parameter count. MambaDNA further incorporates RC equivariance by applying the Mamba module to both the original and reverse complement of a sequence, with shared parameters. This allows the model to handle the reverse complement nature of DNA sequences effectively. Caduceus is the first family of RC-equivariant DNA foundation models. It outperforms previous long-range models on downstream benchmarks, particularly on a challenging long-range variant effect prediction task, where it exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. The model is trained using pre-training and fine-tuning strategies, and it is used for variant effect prediction (VEP), a task that determines whether a genetic mutation influences a phenotype. Caduceus's pre-training implicitly learns to recognize the effects of evolutionary pressure, which is a key source of signal for VEP. Caduceus is implemented in two versions: Caduceus-PS, which uses parameter sharing for RC equivariance, and Caduceus-Ph, which uses post-hoc conjoining during downstream task inference. The model is evaluated on various downstream tasks, including genomic benchmarks, nucleotide transformer tasks, and predicting the effect of variants on gene expression. It consistently outperforms other models, including HyenaDNA and Nucleotide Transformer v2, on these tasks. The results show that Caduceus is effective in long-range DNA sequence modeling and can be used for a wide range of genomics tasks.Caduceus is a novel bi-directional DNA language model that supports reverse complement (RC) equivariance. The model is built upon the Mamba block, which is extended to BiMamba for bi-directionality and to MambaDNA for RC equivariance. BiMamba enables bidirectional sequence modeling by applying the Mamba module twice—once on the original sequence and once on a reversed version—while sharing projection weights to reduce parameter count. MambaDNA further incorporates RC equivariance by applying the Mamba module to both the original and reverse complement of a sequence, with shared parameters. This allows the model to handle the reverse complement nature of DNA sequences effectively. Caduceus is the first family of RC-equivariant DNA foundation models. It outperforms previous long-range models on downstream benchmarks, particularly on a challenging long-range variant effect prediction task, where it exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. The model is trained using pre-training and fine-tuning strategies, and it is used for variant effect prediction (VEP), a task that determines whether a genetic mutation influences a phenotype. Caduceus's pre-training implicitly learns to recognize the effects of evolutionary pressure, which is a key source of signal for VEP. Caduceus is implemented in two versions: Caduceus-PS, which uses parameter sharing for RC equivariance, and Caduceus-Ph, which uses post-hoc conjoining during downstream task inference. The model is evaluated on various downstream tasks, including genomic benchmarks, nucleotide transformer tasks, and predicting the effect of variants on gene expression. It consistently outperforms other models, including HyenaDNA and Nucleotide Transformer v2, on these tasks. The results show that Caduceus is effective in long-range DNA sequence modeling and can be used for a wide range of genomics tasks.
Reach us at info@study.space
[slides and audio] Caduceus%3A Bi-Directional Equivariant Long-Range DNA Sequence Modeling