[slides] Rapid similarity searches of nucleic acid and protein data banks.

The paper presents an efficient algorithm for global similarity searches in nucleic acid and protein sequence databases. The algorithm uses k-tuples of sequence elements to compare sequences, reducing search time while maintaining sensitivity. It has been adapted to produce rigorous sequence alignments. The method is implemented on a DEC KL-10 system, allowing comparison of all sequences in the Protein Data Bank with a 350-residue query in less than 3 minutes and eukaryotic sequences in the Los Alamos Nucleic Acid Data Base with a 500-base query in less than 2 minutes. The algorithm involves identifying k-tuple matches between sequences and focusing on those in "window space," defined as regions around significant diagonals in the dot matrix comparison. This approach reduces computation time compared to traditional methods like Needleman-Wunsch, which require O(N×M) time. The algorithm produces alignments that are optimal under specific scoring rules, with scores based on k-tuple matches in window space. The method is tested on various sequence comparisons, showing good agreement with Needleman-Wunsch results. The parameter w, which defines the window size, significantly affects computation time and alignment quality. Larger w values improve alignment quality but increase computation time, while smaller w values reduce computation time but may lower alignment quality. The algorithm is effective in detecting both strong and weak similarities between sequences. It is particularly useful for searching large databases due to its speed. However, it may not achieve the same resolution as full Needleman-Wunsch or Sellers algorithms. The method's speed makes it suitable for large-scale sequence comparisons, although its results should be interpreted in biological context. The algorithm has been adapted to produce local best alignments, which are more useful for dealing with the inhomogeneity of nucleic acids. The method's efficiency and effectiveness in sequence comparison make it a valuable tool for bioinformatics. The results highlight the importance of balancing computation time and alignment quality in sequence analysis.The paper presents an efficient algorithm for global similarity searches in nucleic acid and protein sequence databases. The algorithm uses k-tuples of sequence elements to compare sequences, reducing search time while maintaining sensitivity. It has been adapted to produce rigorous sequence alignments. The method is implemented on a DEC KL-10 system, allowing comparison of all sequences in the Protein Data Bank with a 350-residue query in less than 3 minutes and eukaryotic sequences in the Los Alamos Nucleic Acid Data Base with a 500-base query in less than 2 minutes. The algorithm involves identifying k-tuple matches between sequences and focusing on those in "window space," defined as regions around significant diagonals in the dot matrix comparison. This approach reduces computation time compared to traditional methods like Needleman-Wunsch, which require O(N×M) time. The algorithm produces alignments that are optimal under specific scoring rules, with scores based on k-tuple matches in window space. The method is tested on various sequence comparisons, showing good agreement with Needleman-Wunsch results. The parameter w, which defines the window size, significantly affects computation time and alignment quality. Larger w values improve alignment quality but increase computation time, while smaller w values reduce computation time but may lower alignment quality. The algorithm is effective in detecting both strong and weak similarities between sequences. It is particularly useful for searching large databases due to its speed. However, it may not achieve the same resolution as full Needleman-Wunsch or Sellers algorithms. The method's speed makes it suitable for large-scale sequence comparisons, although its results should be interpreted in biological context. The algorithm has been adapted to produce local best alignments, which are more useful for dealing with the inhomogeneity of nucleic acids. The method's efficiency and effectiveness in sequence comparison make it a valuable tool for bioinformatics. The results highlight the importance of balancing computation time and alignment quality in sequence analysis.

Rapid similarity searches of nucleic acid and protein data banks

February 1983 | W. J. WILBUR AND DAVID J. LIPMAN