This paper reviews over a decade of research on applying stochastic models to biological sequence analysis, focusing on motifs and domains in bioinformatics. The models, which have a longer history dating back over 30 years, are used to summarize information about motifs and domains and to discover instances of these motifs or domains in sequences. The author introduces motif models in stages, from simple, non-stochastic versions to modern profile Hidden Markov Models (HMMs). A second example is gene finding using sequence data from one or two species, where generalized HMMs or generalized pair HMMs have proven effective.
The paper discusses the use of statistics in studying linear sequences of biomolecular units, emphasizing both descriptive and predictive aspects. It highlights the importance of motif, domain, and site concepts in biological sequence analysis, noting that these elements also embody biochemical significance. The paper covers deterministic models, regular expressions, and sequence logos, which are useful for characterizing motifs. It then introduces profiles, which are sets of position-specific distributions describing a motif, and explains how these can be used to score query sequences to identify instances of the motif.
The paper delves into Hidden Markov Models (HMMs), which are processes where the hidden state and observation at each time step are probabilistically related. HMMs have been widely adopted in genetics and molecular biology due to their elegant dynamic programming algorithms for likelihood calculations. Profile HMMs, introduced by A. Krogh and others, have become the standard approach for representing motifs and protein domains, offering more powerful search capabilities compared to older methods.
The paper also discusses the use of generalized HMMs (GHMMs) for finding genes in DNA sequences, emphasizing the importance of understanding the biology and the challenges in implementing and evaluating algorithms. Finally, it highlights the value of comparative sequence analysis using HMMs and the importance of collaboration with biologists in advancing the field.This paper reviews over a decade of research on applying stochastic models to biological sequence analysis, focusing on motifs and domains in bioinformatics. The models, which have a longer history dating back over 30 years, are used to summarize information about motifs and domains and to discover instances of these motifs or domains in sequences. The author introduces motif models in stages, from simple, non-stochastic versions to modern profile Hidden Markov Models (HMMs). A second example is gene finding using sequence data from one or two species, where generalized HMMs or generalized pair HMMs have proven effective.
The paper discusses the use of statistics in studying linear sequences of biomolecular units, emphasizing both descriptive and predictive aspects. It highlights the importance of motif, domain, and site concepts in biological sequence analysis, noting that these elements also embody biochemical significance. The paper covers deterministic models, regular expressions, and sequence logos, which are useful for characterizing motifs. It then introduces profiles, which are sets of position-specific distributions describing a motif, and explains how these can be used to score query sequences to identify instances of the motif.
The paper delves into Hidden Markov Models (HMMs), which are processes where the hidden state and observation at each time step are probabilistically related. HMMs have been widely adopted in genetics and molecular biology due to their elegant dynamic programming algorithms for likelihood calculations. Profile HMMs, introduced by A. Krogh and others, have become the standard approach for representing motifs and protein domains, offering more powerful search capabilities compared to older methods.
The paper also discusses the use of generalized HMMs (GHMMs) for finding genes in DNA sequences, emphasizing the importance of understanding the biology and the challenges in implementing and evaluating algorithms. Finally, it highlights the value of comparative sequence analysis using HMMs and the importance of collaboration with biologists in advancing the field.