Vol. 85, pp. 2444-2448, April 1988 | WILLIAM R. PEARSON* AND DAVID J. LIPMAN†
The article presents three computer programs—FASTA, RDF2, and LFASTA—for comparing protein and DNA sequences. FASTA is a more sensitive version of FASTP, which can search sequence databases and compare protein sequences to DNA sequences by translating DNA. It improves sensitivity by allowing multiple regions of similarity to be joined. RDF2 evaluates the significance of similarity scores using a shuffling method. LFASTA identifies local similarities between sequences and can display them as a "graphic matrix" plot or individual alignments. These programs are efficient and can run on various computer systems.
The methods involve four steps: identifying regions of identity, rescoring with a scoring matrix, optimizing initial regions, and aligning sequences. FASTA optimizes initial regions to improve similarity scores, while LFASTA computes local alignments for each initial region. The programs use scoring matrices like PAM250 and allow users to specify different matrices.
The article discusses the balance between sensitivity and selectivity in sequence comparison. FASTA improves sensitivity with a small loss of selectivity. Local similarity analyses are important for detecting subsequences in longer sequences. LFASTA is designed for local similarity and can display results as alignments or graphic plots.
The programs also evaluate statistical significance using shuffling methods. RDF2 calculates scores for shuffled sequences and compares them to the original. Local shuffling is more stringent than global shuffling. The programs are implemented in C and run on Unix, VAX/VMS, and IBM PC DOS systems.
Examples show FASTA's superiority in scoring related sequences. FASTA and LFASTA can search DNA databases by translating sequences into reading frames. The programs are flexible, allowing different scoring matrices and parameters. The discussion emphasizes the importance of evaluating similarity scores and the need for careful empirical evaluation of algorithms. The programs provide a consistent measure for scoring similarity and constructing alignments, enabling further analysis of related sequences.The article presents three computer programs—FASTA, RDF2, and LFASTA—for comparing protein and DNA sequences. FASTA is a more sensitive version of FASTP, which can search sequence databases and compare protein sequences to DNA sequences by translating DNA. It improves sensitivity by allowing multiple regions of similarity to be joined. RDF2 evaluates the significance of similarity scores using a shuffling method. LFASTA identifies local similarities between sequences and can display them as a "graphic matrix" plot or individual alignments. These programs are efficient and can run on various computer systems.
The methods involve four steps: identifying regions of identity, rescoring with a scoring matrix, optimizing initial regions, and aligning sequences. FASTA optimizes initial regions to improve similarity scores, while LFASTA computes local alignments for each initial region. The programs use scoring matrices like PAM250 and allow users to specify different matrices.
The article discusses the balance between sensitivity and selectivity in sequence comparison. FASTA improves sensitivity with a small loss of selectivity. Local similarity analyses are important for detecting subsequences in longer sequences. LFASTA is designed for local similarity and can display results as alignments or graphic plots.
The programs also evaluate statistical significance using shuffling methods. RDF2 calculates scores for shuffled sequences and compares them to the original. Local shuffling is more stringent than global shuffling. The programs are implemented in C and run on Unix, VAX/VMS, and IBM PC DOS systems.
Examples show FASTA's superiority in scoring related sequences. FASTA and LFASTA can search DNA databases by translating sequences into reading frames. The programs are flexible, allowing different scoring matrices and parameters. The discussion emphasizes the importance of evaluating similarity scores and the need for careful empirical evaluation of algorithms. The programs provide a consistent measure for scoring similarity and constructing alignments, enabling further analysis of related sequences.