2009 | Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven J.M. Jones, and inanc Birol
ABySS is a parallel assembler for short read sequence data developed to efficiently assemble large-scale sequencing projects, such as human genome sequencing. It uses a distributed de Bruijn graph approach to enable parallel computation across a network of commodity computers. The software was tested on 3.5 billion paired-end reads from the genome of an African male, resulting in 2.76 million contigs ≥100 bp in length with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and other primate genomes.
Massively parallel sequencing technologies have enabled high-throughput DNA sequencing, but the challenge remains in assembling millions or billions of short reads. ABySS addresses this by using a distributed de Bruijn graph approach, allowing efficient assembly of large datasets. The algorithm proceeds in two stages: first, generating k-mers from sequence reads and removing read errors; second, using mate-pair information to extend contigs by resolving ambiguities in contig overlaps.
ABySS was evaluated using simulated data sets and experimental sequence data. Simulated data showed that ABySS could assemble 94.4% of contigs accurately, with 80% of the reference genome represented by contigs ≥100 bp. Experimental data from an African male's genome showed that 94.2% of contigs were correctly assembled, covering 68.2% of the human reference sequence. ABySS also identified 110,177 deletions and 101,578 insertions, some of which were novel or not previously identified.
ABySS was compared with other short read assemblers, showing competitive performance. It was able to accurately reconstruct the majority of the E. coli genome with contigs ≥100 bp. The software is implemented in C++ and uses MPI for communication between nodes, with a distributed de Bruijn graph representation allowing efficient scaling with genome size.
ABySS is particularly useful for sequencing organisms without a reference sequence and has the potential to improve the contiguity of assemblies by resolving repetitive regions. It provides insights into genetic variation, including novel sequences not present in the human reference genome. The software is freely available for use.ABySS is a parallel assembler for short read sequence data developed to efficiently assemble large-scale sequencing projects, such as human genome sequencing. It uses a distributed de Bruijn graph approach to enable parallel computation across a network of commodity computers. The software was tested on 3.5 billion paired-end reads from the genome of an African male, resulting in 2.76 million contigs ≥100 bp in length with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and other primate genomes.
Massively parallel sequencing technologies have enabled high-throughput DNA sequencing, but the challenge remains in assembling millions or billions of short reads. ABySS addresses this by using a distributed de Bruijn graph approach, allowing efficient assembly of large datasets. The algorithm proceeds in two stages: first, generating k-mers from sequence reads and removing read errors; second, using mate-pair information to extend contigs by resolving ambiguities in contig overlaps.
ABySS was evaluated using simulated data sets and experimental sequence data. Simulated data showed that ABySS could assemble 94.4% of contigs accurately, with 80% of the reference genome represented by contigs ≥100 bp. Experimental data from an African male's genome showed that 94.2% of contigs were correctly assembled, covering 68.2% of the human reference sequence. ABySS also identified 110,177 deletions and 101,578 insertions, some of which were novel or not previously identified.
ABySS was compared with other short read assemblers, showing competitive performance. It was able to accurately reconstruct the majority of the E. coli genome with contigs ≥100 bp. The software is implemented in C++ and uses MPI for communication between nodes, with a distributed de Bruijn graph representation allowing efficient scaling with genome size.
ABySS is particularly useful for sequencing organisms without a reference sequence and has the potential to improve the contiguity of assemblies by resolving repetitive regions. It provides insights into genetic variation, including novel sequences not present in the human reference genome. The software is freely available for use.