[slides] Repetitive DNA and next-generation sequencing%3A computational challenges and solutions

Repetitive DNA sequences are abundant in many species, including humans, and cover nearly half of the human genome. These sequences pose significant computational challenges for sequence alignment and genome assembly, especially with next-generation sequencing (NGS) technologies that produce short reads and large data volumes. Repetitive sequences create ambiguities in alignment and assembly, leading to potential errors in interpreting results. Ignoring repeats is not an option, as it may result in missing important biological information. Repeats can be short or long, and they can occur in various forms, such as tandem repeats or interspersed repeats. They play roles in evolution and can be functional or non-functional. Current bioinformatics tools address these challenges by using strategies to handle repeats in alignment and assembly processes. For example, alignment tools like Bowtie and BWA can efficiently align reads, while assembly tools use graphs to reconstruct genomes. In genome resequencing, reads are mapped to a reference genome, and tools like GATK and SAMtools are used to detect SNPs and structural variants. However, multi-reads that align to multiple locations can lead to errors if not handled properly. Strategies such as best match alignment or allowing all alignments can help resolve these issues. In de novo genome assembly, repeats cause gaps and misassemblies, making it difficult to reconstruct the genome accurately. Short read lengths and high repeat content make assembly more challenging. Strategies such as using mate-pair libraries and statistical methods help overcome these challenges. For RNA-seq analysis, spliced alignment is necessary due to the presence of introns. Tools like TopHat and MapSplice can align reads to splice junctions. However, multi-reads can introduce errors in gene expression estimation, and strategies like ERANGE and RSEM help improve accuracy. Overall, computational tools and strategies are essential for handling repetitive DNA in NGS data. Advances in sequencing technology and computational methods continue to improve the ability to analyze and assemble genomes and transcriptomes accurately.Repetitive DNA sequences are abundant in many species, including humans, and cover nearly half of the human genome. These sequences pose significant computational challenges for sequence alignment and genome assembly, especially with next-generation sequencing (NGS) technologies that produce short reads and large data volumes. Repetitive sequences create ambiguities in alignment and assembly, leading to potential errors in interpreting results. Ignoring repeats is not an option, as it may result in missing important biological information. Repeats can be short or long, and they can occur in various forms, such as tandem repeats or interspersed repeats. They play roles in evolution and can be functional or non-functional. Current bioinformatics tools address these challenges by using strategies to handle repeats in alignment and assembly processes. For example, alignment tools like Bowtie and BWA can efficiently align reads, while assembly tools use graphs to reconstruct genomes. In genome resequencing, reads are mapped to a reference genome, and tools like GATK and SAMtools are used to detect SNPs and structural variants. However, multi-reads that align to multiple locations can lead to errors if not handled properly. Strategies such as best match alignment or allowing all alignments can help resolve these issues. In de novo genome assembly, repeats cause gaps and misassemblies, making it difficult to reconstruct the genome accurately. Short read lengths and high repeat content make assembly more challenging. Strategies such as using mate-pair libraries and statistical methods help overcome these challenges. For RNA-seq analysis, spliced alignment is necessary due to the presence of introns. Tools like TopHat and MapSplice can align reads to splice junctions. However, multi-reads can introduce errors in gene expression estimation, and strategies like ERANGE and RSEM help improve accuracy. Overall, computational tools and strategies are essential for handling repetitive DNA in NGS data. Advances in sequencing technology and computational methods continue to improve the ability to analyze and assemble genomes and transcriptomes accurately.

Repetitive DNA and next-generation sequencing: computational challenges and solutions

2012 | Todd J. Treangen and Steven L. Salzberg