SSAHA: A Fast Search Method for Large DNA Databases

SSAHA: A Fast Search Method for Large DNA Databases

April 26, 2001 | Zemin Ning, Anthony J. Cox, and James C. Mullikin
SSAHA is a fast search method for large DNA databases. It uses a hash table to store positions of k-tuples in the database, enabling rapid searches. The algorithm preprocesses sequences by breaking them into k-tuples and storing their positions in a hash table. Searching involves retrieving hits for each k-tuple in the query sequence and sorting the results. The choice of k affects search speed, memory usage, and sensitivity. Computational experiments show SSAHA is three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. SSAHA is used for SNP detection and large-scale sequence assembly, and provides Web-based search facilities for Ensembl projects. The SSAHA algorithm organizes the DNA database into a hash table data structure. It uses a 2-bit encoding for nucleotides and stores k-tuples in the hash table. The hash table is constructed by making two passes through the subject data, counting non-overlapping k-tuples and allocating memory for the list of positions. Words with a frequency exceeding a cutoff threshold are excluded to reduce the hash table size and filter out spurious matches. The algorithm is efficient for large databases due to the use of hashing, allowing search time independent of database size as long as k is chosen to keep W/4^k small. However, SSAHA requires large amounts of RAM to store the hash table. The search process involves retrieving hits for each k-tuple in the query sequence, sorting the results, and identifying runs of hits that indicate potential matches. The algorithm can be optimized by varying k and the cutoff threshold N. SSAHA is sensitive to the number of consecutive matching bases required for a hit, requiring 2k-1 bases for a guaranteed hit. Modifications can increase sensitivity by allowing substitutions or base-by-base hashing. SSAHA has been implemented for various applications, including SNP detection and sequence assembly. It has been adapted for the detection of single nucleotide polymorphisms (SNPs) and is used to process genomic reads and detect SNPs. The SSAHA library is used for building applications and has been ported to multiple platforms. The algorithm is efficient for large databases and is suitable for applications requiring "almost exact" matches. It is used for genome assembly, contig ordering, and SNP detection. The SSAHA algorithm is fast and efficient, making it suitable for large-scale DNA database searches.SSAHA is a fast search method for large DNA databases. It uses a hash table to store positions of k-tuples in the database, enabling rapid searches. The algorithm preprocesses sequences by breaking them into k-tuples and storing their positions in a hash table. Searching involves retrieving hits for each k-tuple in the query sequence and sorting the results. The choice of k affects search speed, memory usage, and sensitivity. Computational experiments show SSAHA is three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. SSAHA is used for SNP detection and large-scale sequence assembly, and provides Web-based search facilities for Ensembl projects. The SSAHA algorithm organizes the DNA database into a hash table data structure. It uses a 2-bit encoding for nucleotides and stores k-tuples in the hash table. The hash table is constructed by making two passes through the subject data, counting non-overlapping k-tuples and allocating memory for the list of positions. Words with a frequency exceeding a cutoff threshold are excluded to reduce the hash table size and filter out spurious matches. The algorithm is efficient for large databases due to the use of hashing, allowing search time independent of database size as long as k is chosen to keep W/4^k small. However, SSAHA requires large amounts of RAM to store the hash table. The search process involves retrieving hits for each k-tuple in the query sequence, sorting the results, and identifying runs of hits that indicate potential matches. The algorithm can be optimized by varying k and the cutoff threshold N. SSAHA is sensitive to the number of consecutive matching bases required for a hit, requiring 2k-1 bases for a guaranteed hit. Modifications can increase sensitivity by allowing substitutions or base-by-base hashing. SSAHA has been implemented for various applications, including SNP detection and sequence assembly. It has been adapted for the detection of single nucleotide polymorphisms (SNPs) and is used to process genomic reads and detect SNPs. The SSAHA library is used for building applications and has been ported to multiple platforms. The algorithm is efficient for large databases and is suitable for applications requiring "almost exact" matches. It is used for genome assembly, contig ordering, and SNP detection. The SSAHA algorithm is fast and efficient, making it suitable for large-scale DNA database searches.
Reach us at info@futurestudyspace.com
Understanding SSAHA%3A a fast search method for large DNA databases.