[slides and audio] Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

This paper presents a method for identifying DNA and protein patterns using statistically significant alignments of multiple sequences. The authors propose a four-component approach to determine alignments of multiple sequences. First, they describe a log-likelihood scoring scheme called information content. Second, they describe two methods for estimating the P value of an individual information content score: one combining large-deviation statistics with numerical calculations, and another purely numerical. Third, they describe how to count the number of possible alignments given the sequence data, which is then multiplied by the P value to determine the expected frequency of an information content score and its statistical significance. Fourth, they describe a greedy algorithm for determining alignments of functionally related sequences. The authors also test the accuracy of their P value calculations and provide an example of using their algorithm to identify binding sites for the Escherichia coli CRP protein. The paper discusses the importance of statistical significance in comparing alignments with different widths and numbers of sequences. It also describes the use of a large-deviation technique for approximating P values, which involves calculating the moment-generating function and its derivatives for the statistic of interest. The authors describe a dynamic programming algorithm for efficiently calculating these functions for DNA alignments. They also present an alternative method for approximating P values by creating a table of P values for the statistic after it has been transformed into integer values. The paper also discusses the counting of possible alignments and the use of expected frequency statistics to compare alignments. It highlights the importance of considering the independence of alignments when calculating statistical significance. The authors describe their greedy algorithm for determining alignments with the highest information content, which is order-independent and can be used to identify functional relationships between sequences. The algorithm is implemented in a program called CONSENSUS, which is available for download. The paper concludes with a discussion of the limitations of the methods and the importance of considering the statistical significance of alignments in sequence analysis.This paper presents a method for identifying DNA and protein patterns using statistically significant alignments of multiple sequences. The authors propose a four-component approach to determine alignments of multiple sequences. First, they describe a log-likelihood scoring scheme called information content. Second, they describe two methods for estimating the P value of an individual information content score: one combining large-deviation statistics with numerical calculations, and another purely numerical. Third, they describe how to count the number of possible alignments given the sequence data, which is then multiplied by the P value to determine the expected frequency of an information content score and its statistical significance. Fourth, they describe a greedy algorithm for determining alignments of functionally related sequences. The authors also test the accuracy of their P value calculations and provide an example of using their algorithm to identify binding sites for the Escherichia coli CRP protein. The paper discusses the importance of statistical significance in comparing alignments with different widths and numbers of sequences. It also describes the use of a large-deviation technique for approximating P values, which involves calculating the moment-generating function and its derivatives for the statistic of interest. The authors describe a dynamic programming algorithm for efficiently calculating these functions for DNA alignments. They also present an alternative method for approximating P values by creating a table of P values for the statistic after it has been transformed into integer values. The paper also discusses the counting of possible alignments and the use of expected frequency statistics to compare alignments. It highlights the importance of considering the independence of alignments when calculating statistical significance. The authors describe their greedy algorithm for determining alignments with the highest information content, which is order-independent and can be used to identify functional relationships between sequences. The algorithm is implemented in a program called CONSENSUS, which is available for download. The paper concludes with a discussion of the limitations of the methods and the importance of considering the statistical significance of alignments in sequence analysis.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

1999 | Gerald Z. Hertz and Gary D. Stormo