SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

March 27, 2024 | Bernardo P. de Almeida, Hugo Dalla-Torre, Guillaume Richard, Christopher Blum, Lorenz Hexemer, Maxence Gérald, Javier Mendoza-Revilla, Priyanka Pandey, Stefan Laurent, Marie Lopez, Alexandre Laterre, Maren Lang, Uğur Şahin, Karim Beguir, Thomas Pierrot
SegmentNT is a DNA segmentation model that combines a pre-trained DNA foundation model, Nucleotide Transformer (NT), with a 1D U-Net architecture to predict 14 types of genomic elements at single-nucleotide resolution. It processes DNA sequences up to 30 kb in length and achieves high accuracy in identifying genomic elements, including splice sites, exons, introns, and regulatory regions. SegmentNT outperforms other models, including convolutional networks and models trained from scratch, and can generalize to sequences up to 50 kb. It also demonstrates strong performance in predicting the impact of sequence variants on gene structure and function. SegmentNT can be extended to additional genomic elements and species, showing strong generalization across different species, including plants. The model is available on GitHub and HuggingFace. SegmentNT highlights the potential of DNA foundation models in genomics, enabling precise annotation and interpretation of DNA sequences at single-nucleotide resolution. It provides strong evidence that DNA foundation models can tackle complex tasks in genomics, including the prediction of regulatory elements and gene structure. SegmentNT's performance is particularly strong for splice sites and exons, which are highly conserved. The model can be extended to other species, including plants, and shows improved generalization when trained on multiple species. SegmentNT's architecture allows for efficient inference and accurate prediction of genomic elements, making it a valuable tool for genomics research. The model's ability to generalize across species and genomic elements suggests its potential for broader applications in genomics. SegmentNT's results demonstrate the effectiveness of pre-trained DNA foundation models in genomics, offering a new paradigm for analyzing and interpreting DNA sequences.SegmentNT is a DNA segmentation model that combines a pre-trained DNA foundation model, Nucleotide Transformer (NT), with a 1D U-Net architecture to predict 14 types of genomic elements at single-nucleotide resolution. It processes DNA sequences up to 30 kb in length and achieves high accuracy in identifying genomic elements, including splice sites, exons, introns, and regulatory regions. SegmentNT outperforms other models, including convolutional networks and models trained from scratch, and can generalize to sequences up to 50 kb. It also demonstrates strong performance in predicting the impact of sequence variants on gene structure and function. SegmentNT can be extended to additional genomic elements and species, showing strong generalization across different species, including plants. The model is available on GitHub and HuggingFace. SegmentNT highlights the potential of DNA foundation models in genomics, enabling precise annotation and interpretation of DNA sequences at single-nucleotide resolution. It provides strong evidence that DNA foundation models can tackle complex tasks in genomics, including the prediction of regulatory elements and gene structure. SegmentNT's performance is particularly strong for splice sites and exons, which are highly conserved. The model can be extended to other species, including plants, and shows improved generalization when trained on multiple species. SegmentNT's architecture allows for efficient inference and accurate prediction of genomic elements, making it a valuable tool for genomics research. The model's ability to generalize across species and genomic elements suggests its potential for broader applications in genomics. SegmentNT's results demonstrate the effectiveness of pre-trained DNA foundation models in genomics, offering a new paradigm for analyzing and interpreting DNA sequences.
Reach us at info@study.space
[slides] Annotating the genome at single-nucleotide resolution with DNA foundation models | StudySpace