January 12, 2018 | Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner
The paper introduces Flye, a new algorithm for assembling long error-prone reads into a repeat graph, which is a more accurate representation of genome repeats than traditional de Bruijn graphs. Unlike existing assemblers that aim to generate accurate contigs, Flye generates arbitrary paths in the assembly graph and then refines them to produce accurate contigs. This approach results in the same graph as the assembly graph constructed from accurate contigs. Flye resolves unbridged repeats by using small variations between repeat copies and constructs a new, less tangled assembly graph. It is benchmarked against several state-of-the-art single-molecule sequencing assemblers and demonstrates better or comparable assembly results for all analyzed datasets.
The paper discusses the challenges of genome assembly, particularly the characterization of repeats in a genome. Long read technologies have shifted the focus from short repeats to longer repeats comparable in length to the median single-molecule sequencing (SMS) read size. The de Bruijn graph, while popular for short read assembly, is not suitable for long read assemblies due to its noise and inability to collapse repeat instances into a single path. Alternative approaches, such as the overlap-layout-consensus (OLC) method, have been used but are not optimal for repeat resolution. The paper describes how Flye uses a repeat graph to represent all repeats in a genome and how it resolves unbridged repeats using variations between repeat copies.
Flye constructs an assembly graph from polished contigs rather than error-prone reads, making the graph more accurate. It also complements HINGE by introducing a new algorithm that uses small differences between repeat copies to resolve unbridged repeats not spanned by any reads. The algorithm uses a repeat graph to represent all repeats in a genome and constructs an assembly graph from this repeat graph. The assembly graph is then refined to resolve unbridged repeats and produce accurate contigs.
The paper presents results showing that Flye performs well on various datasets, including the BACTERIA, YEAST, WORM, HUMAN, and HUMAN+ datasets. It demonstrates that Flye generates more accurate and contiguous assemblies than other assemblers, particularly on complex datasets like YEAST and WORM. The algorithm is available for free and is described in detail in the paper. The paper also discusses the importance of repeat characterization in genome assembly and how Flye's approach provides a useful framework for planning additional experiments to finish genome assemblies.The paper introduces Flye, a new algorithm for assembling long error-prone reads into a repeat graph, which is a more accurate representation of genome repeats than traditional de Bruijn graphs. Unlike existing assemblers that aim to generate accurate contigs, Flye generates arbitrary paths in the assembly graph and then refines them to produce accurate contigs. This approach results in the same graph as the assembly graph constructed from accurate contigs. Flye resolves unbridged repeats by using small variations between repeat copies and constructs a new, less tangled assembly graph. It is benchmarked against several state-of-the-art single-molecule sequencing assemblers and demonstrates better or comparable assembly results for all analyzed datasets.
The paper discusses the challenges of genome assembly, particularly the characterization of repeats in a genome. Long read technologies have shifted the focus from short repeats to longer repeats comparable in length to the median single-molecule sequencing (SMS) read size. The de Bruijn graph, while popular for short read assembly, is not suitable for long read assemblies due to its noise and inability to collapse repeat instances into a single path. Alternative approaches, such as the overlap-layout-consensus (OLC) method, have been used but are not optimal for repeat resolution. The paper describes how Flye uses a repeat graph to represent all repeats in a genome and how it resolves unbridged repeats using variations between repeat copies.
Flye constructs an assembly graph from polished contigs rather than error-prone reads, making the graph more accurate. It also complements HINGE by introducing a new algorithm that uses small differences between repeat copies to resolve unbridged repeats not spanned by any reads. The algorithm uses a repeat graph to represent all repeats in a genome and constructs an assembly graph from this repeat graph. The assembly graph is then refined to resolve unbridged repeats and produce accurate contigs.
The paper presents results showing that Flye performs well on various datasets, including the BACTERIA, YEAST, WORM, HUMAN, and HUMAN+ datasets. It demonstrates that Flye generates more accurate and contiguous assemblies than other assemblers, particularly on complex datasets like YEAST and WORM. The algorithm is available for free and is described in detail in the paper. The paper also discusses the importance of repeat characterization in genome assembly and how Flye's approach provides a useful framework for planning additional experiments to finish genome assemblies.