2004 | Ewan Birney, Michele Clamp, and Richard Durbin
GeneWise and Genomewise are two algorithms developed by Ewan Birney, Michele Clamp, and Richard Durbin. Both are used by the Ensembl annotation system to predict gene structures. GeneWise uses homologous protein sequences to predict gene structures, while Genomewise uses cDNA and EST data. Both algorithms are highly accurate and can produce accurate and complete gene structures when used with correct evidence.
The Ensembl gene prediction pipeline uses experimental evidence from mature mRNA structures to produce accurate gene predictions. It uses two types of evidence: direct placement of cDNA and EST on the genome, and evidence from related genes in other species. The pipeline involves collecting evidence for a transcript and then constructing a valid transcript structure. After this, the final gene builder rationalizes the cDNA and EST data to form final genes.
GeneWise is a mature tool with implementations since 1997 and has been used in numerous genome projects. It has been assessed by several authors, but no paper has detailed its theory. Genomewise is a newer method developed specifically for the Ensembl pipeline and may be useful outside of this context.
Both algorithms were implemented using the dynamic programming language Dynamite, which provides a higher-level language for specifying HMMs and dynamic programming recursions. This allows for efficient and bug-free code, enabling quick development of new algorithm variations.
GeneWise solves the problem of comparing a protein sequence or HMM directly to genomic DNA, taking into account the statistical properties of gene structures and sequencing errors. It uses pair-HMMs to represent the alignment process of two protein sequences and the prediction of a protein-coding gene structure. The algorithm merges two HMMs into a single HMM to produce a combined algorithm for any HMM model of gene structure and any HMM model of alignment.
The GeneWise model integrates two separate models: a gene prediction model and a protein homology model. The genomic sequence is equivalent to the A sequence, the predicted protein sequence is the B sequence, and the homologous protein sequence is the C sequence. The aim is to compare the genomic sequence directly to the homologous protein sequence considering all possible intermediates of the predicted protein.
The protein model is a probabilistic Smith-Waterman model and is 0th order. The gene prediction model is generally of a higher order, and simplifications are made to make the merging process achievable. The model ignores amino acids split by introns and shortens high-order Markov dependencies in the coding sequence model.
The combined model has 10 × 3 states, expanding each homology model state into 10 separate gene-finding states. The pruned model, called GeneWise 21:93, is shown in Figure 2. However, not all of these states are actually required in the comparison, as some transitions are forced to zero.
The GeneWise model was used to predict gene structures with high accuracy.GeneWise and Genomewise are two algorithms developed by Ewan Birney, Michele Clamp, and Richard Durbin. Both are used by the Ensembl annotation system to predict gene structures. GeneWise uses homologous protein sequences to predict gene structures, while Genomewise uses cDNA and EST data. Both algorithms are highly accurate and can produce accurate and complete gene structures when used with correct evidence.
The Ensembl gene prediction pipeline uses experimental evidence from mature mRNA structures to produce accurate gene predictions. It uses two types of evidence: direct placement of cDNA and EST on the genome, and evidence from related genes in other species. The pipeline involves collecting evidence for a transcript and then constructing a valid transcript structure. After this, the final gene builder rationalizes the cDNA and EST data to form final genes.
GeneWise is a mature tool with implementations since 1997 and has been used in numerous genome projects. It has been assessed by several authors, but no paper has detailed its theory. Genomewise is a newer method developed specifically for the Ensembl pipeline and may be useful outside of this context.
Both algorithms were implemented using the dynamic programming language Dynamite, which provides a higher-level language for specifying HMMs and dynamic programming recursions. This allows for efficient and bug-free code, enabling quick development of new algorithm variations.
GeneWise solves the problem of comparing a protein sequence or HMM directly to genomic DNA, taking into account the statistical properties of gene structures and sequencing errors. It uses pair-HMMs to represent the alignment process of two protein sequences and the prediction of a protein-coding gene structure. The algorithm merges two HMMs into a single HMM to produce a combined algorithm for any HMM model of gene structure and any HMM model of alignment.
The GeneWise model integrates two separate models: a gene prediction model and a protein homology model. The genomic sequence is equivalent to the A sequence, the predicted protein sequence is the B sequence, and the homologous protein sequence is the C sequence. The aim is to compare the genomic sequence directly to the homologous protein sequence considering all possible intermediates of the predicted protein.
The protein model is a probabilistic Smith-Waterman model and is 0th order. The gene prediction model is generally of a higher order, and simplifications are made to make the merging process achievable. The model ignores amino acids split by introns and shortens high-order Markov dependencies in the coding sequence model.
The combined model has 10 × 3 states, expanding each homology model state into 10 separate gene-finding states. The pruned model, called GeneWise 21:93, is shown in Figure 2. However, not all of these states are actually required in the comparison, as some transitions are forced to zero.
The GeneWise model was used to predict gene structures with high accuracy.