[slides] Biological sequence analysis

This talk reviews over a decade of research on applying stochastic models to biological sequence analysis. These models, with a history of over 30 years, are used to summarize information about motifs or domains in bioinformatics and to identify instances of these in separate sequences. The talk introduces motif models starting from simple, non-stochastic versions, progressing to modern profile HMMs. It also discusses gene finding using generalized HMMs or generalized pair HMMs. DNA, RNA, and proteins are polymers composed of smaller units. The sequence of these units determines their chemical properties. Statistical methods are used to study these sequences, either descriptively or predictively. While statistical models are useful, their underlying mechanisms should not be taken literally. Models can fail without notice, and biological confirmation of predictions is often necessary. Biological sequences can be analyzed globally or locally. For example, genome base composition varies between species, and specific sequences like ATG are common motifs. Statistics help characterize sequences and identify them against a background of other sequences. Deterministic models, such as regular expressions, are used to describe motifs. However, they can have false positives and negatives. Regular expressions are limited in capturing sequence variability, so position-specific distributions are used instead, represented by sequence logos. Profiles are sets of position-specific distributions describing motifs. Profile scores are log-likelihood ratios comparing a motif model to a background model. These scores can be modified to account for evolutionary patterns. Profile HMMs are used to model motifs and domains, incorporating insertions and deletions. Hidden Markov Models (HMMs) are used in sequence analysis, allowing for probabilistic modeling of sequences. Profile HMMs are a type of HMM used for motifs and domains. They have become the standard approach in bioinformatics, with databases like Pfam containing thousands of HMMs. Gene finding in DNA sequences is challenging, requiring computational methods due to the large amount of genomic data. Generalized HMMs (GHMMs) are effective for gene prediction, modeling features like exons, introns, and reading frames. Comparative sequence analysis using HMMs incorporates evolutionary conservation. Pair HMMs and generalized pair HMMs (GPHMMs) are used for aligning and finding genes in homologous sequences. Challenges in biological sequence analysis include understanding biology, designing models, and implementing algorithms. HMMs have been successful in applying mathematics to bioinformatics, and their use is widely recognized. The talk acknowledges the contributions of others and highlights the importance of HMMs in this field.This talk reviews over a decade of research on applying stochastic models to biological sequence analysis. These models, with a history of over 30 years, are used to summarize information about motifs or domains in bioinformatics and to identify instances of these in separate sequences. The talk introduces motif models starting from simple, non-stochastic versions, progressing to modern profile HMMs. It also discusses gene finding using generalized HMMs or generalized pair HMMs. DNA, RNA, and proteins are polymers composed of smaller units. The sequence of these units determines their chemical properties. Statistical methods are used to study these sequences, either descriptively or predictively. While statistical models are useful, their underlying mechanisms should not be taken literally. Models can fail without notice, and biological confirmation of predictions is often necessary. Biological sequences can be analyzed globally or locally. For example, genome base composition varies between species, and specific sequences like ATG are common motifs. Statistics help characterize sequences and identify them against a background of other sequences. Deterministic models, such as regular expressions, are used to describe motifs. However, they can have false positives and negatives. Regular expressions are limited in capturing sequence variability, so position-specific distributions are used instead, represented by sequence logos. Profiles are sets of position-specific distributions describing motifs. Profile scores are log-likelihood ratios comparing a motif model to a background model. These scores can be modified to account for evolutionary patterns. Profile HMMs are used to model motifs and domains, incorporating insertions and deletions. Hidden Markov Models (HMMs) are used in sequence analysis, allowing for probabilistic modeling of sequences. Profile HMMs are a type of HMM used for motifs and domains. They have become the standard approach in bioinformatics, with databases like Pfam containing thousands of HMMs. Gene finding in DNA sequences is challenging, requiring computational methods due to the large amount of genomic data. Generalized HMMs (GHMMs) are effective for gene prediction, modeling features like exons, introns, and reading frames. Comparative sequence analysis using HMMs incorporates evolutionary conservation. Pair HMMs and generalized pair HMMs (GPHMMs) are used for aligning and finding genes in homologous sequences. Challenges in biological sequence analysis include understanding biology, designing models, and implementing algorithms. HMMs have been successful in applying mathematics to bioinformatics, and their use is widely recognized. The talk acknowledges the contributions of others and highlights the importance of HMMs in this field.

Biological Sequence Analysis

2003 | T. P. Speed