May 14, 2004 | W. H. Majoros*, M. Pertea and S. L. Salzberg
Two open-source ab initio eukaryotic gene-finders, TigrScan and GlimmerHMM, are described. Both are based on Generalized Hidden Markov Models (GHMMs) and are available as open-source software with modular and extensible architectures. They are re-trainable and re-configurable, allowing users to combine various probabilistic submodels, such as Maximal Dependence Decomposition trees and interpolated Markov models. Both programs have been used for genome annotation at TIGR, including the Aspergillus fumigatus and Toxoplasma gondii genomes.
TigrScan uses weight matrices and Markov chains, while GlimmerHMM incorporates splice site models and decision trees. Both programs use interpolated Markov models and Maximal Dependence Decomposition to improve splice site identification. TigrScan includes introns, intergenic regions, 5' and 3' untranslated regions, and four types of exons. GlimmerHMM includes states for exons, introns, and intergenic regions. TigrScan also provides a graph-theoretic representation of high-scoring open reading frames and can read and score gene models in GFF format.
Both programs performed well compared to Genscan+ in tests. TigrScan was most competitive on the Arabidopsis thaliana test set, while GlimmerHMM performed best on the Aspergillus fumigatus test set. The programs' performance highlights the value of retraining gene finders for specific organisms. Time and memory requirements increase linearly with input sequence length, but the programs make different trade-offs between speed and space. TigrScan successfully processed a 5.6 Mb contig in 5 minutes 32 seconds using 105 Mb of RAM on a 1.6 GHz Pentium IV, demonstrating that long sequences can be processed on machines with limited memory.
By offering these programs as open-source, the authors hope to facilitate more studies comparing the suitability of alternative gene-finding strategies. The work was supported by NIH and NSF grants.Two open-source ab initio eukaryotic gene-finders, TigrScan and GlimmerHMM, are described. Both are based on Generalized Hidden Markov Models (GHMMs) and are available as open-source software with modular and extensible architectures. They are re-trainable and re-configurable, allowing users to combine various probabilistic submodels, such as Maximal Dependence Decomposition trees and interpolated Markov models. Both programs have been used for genome annotation at TIGR, including the Aspergillus fumigatus and Toxoplasma gondii genomes.
TigrScan uses weight matrices and Markov chains, while GlimmerHMM incorporates splice site models and decision trees. Both programs use interpolated Markov models and Maximal Dependence Decomposition to improve splice site identification. TigrScan includes introns, intergenic regions, 5' and 3' untranslated regions, and four types of exons. GlimmerHMM includes states for exons, introns, and intergenic regions. TigrScan also provides a graph-theoretic representation of high-scoring open reading frames and can read and score gene models in GFF format.
Both programs performed well compared to Genscan+ in tests. TigrScan was most competitive on the Arabidopsis thaliana test set, while GlimmerHMM performed best on the Aspergillus fumigatus test set. The programs' performance highlights the value of retraining gene finders for specific organisms. Time and memory requirements increase linearly with input sequence length, but the programs make different trade-offs between speed and space. TigrScan successfully processed a 5.6 Mb contig in 5 minutes 32 seconds using 105 Mb of RAM on a 1.6 GHz Pentium IV, demonstrating that long sequences can be processed on machines with limited memory.
By offering these programs as open-source, the authors hope to facilitate more studies comparing the suitability of alternative gene-finding strategies. The work was supported by NIH and NSF grants.