Designing realistic regulatory DNA with autoregressive language models

Designing realistic regulatory DNA with autoregressive language models

2024 | Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan
RegLM is a framework that uses autoregressive language models to design synthetic cis-regulatory elements (CREs) with desired properties, such as high or low activity, or cell type-specific activity. The framework combines autoregressive language models with supervised sequence-to-function models to generate realistic regulatory DNA sequences. The study demonstrates that regLM can generate synthetic yeast promoters and human enhancers that are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. The framework adapts the HyenaDNA foundation model for CRE generation, which is a single-nucleotide resolution autoregressive model trained on the human genome. regLM uses prompt tokens to encode functional labels and generates DNA sequences with desired properties. The model is trained on a dataset of yeast promoter sequences and human enhancer sequences, and it is evaluated for its ability to generate realistic regulatory DNA sequences. The study shows that regLM-generated promoters and enhancers have high accuracy in predicting promoter activity and contain motifs that are consistent with known regulatory syntax. The framework is also used to generate cell type-specific human enhancers, which are validated using multiple models. The results demonstrate that regLM can generate realistic regulatory DNA sequences with desired properties, providing insights into the cis-regulatory code. The study highlights the potential of autoregressive language models in generating synthetic regulatory DNA sequences with desired properties.RegLM is a framework that uses autoregressive language models to design synthetic cis-regulatory elements (CREs) with desired properties, such as high or low activity, or cell type-specific activity. The framework combines autoregressive language models with supervised sequence-to-function models to generate realistic regulatory DNA sequences. The study demonstrates that regLM can generate synthetic yeast promoters and human enhancers that are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. The framework adapts the HyenaDNA foundation model for CRE generation, which is a single-nucleotide resolution autoregressive model trained on the human genome. regLM uses prompt tokens to encode functional labels and generates DNA sequences with desired properties. The model is trained on a dataset of yeast promoter sequences and human enhancer sequences, and it is evaluated for its ability to generate realistic regulatory DNA sequences. The study shows that regLM-generated promoters and enhancers have high accuracy in predicting promoter activity and contain motifs that are consistent with known regulatory syntax. The framework is also used to generate cell type-specific human enhancers, which are validated using multiple models. The results demonstrate that regLM can generate realistic regulatory DNA sequences with desired properties, providing insights into the cis-regulatory code. The study highlights the potential of autoregressive language models in generating synthetic regulatory DNA sequences with desired properties.
Reach us at info@study.space