A probabilistic model is introduced for predicting complete gene structures in human genomic DNA. The model incorporates transcriptional, translational, and splicing signals, as well as exon, intron, and intergenic region characteristics. It accounts for differences in gene density and structure across distinct C+G compositional regions of the human genome. New models of donor and acceptor splice signals are described, capturing potential dependencies between signal positions. The model is applied to the GENSCAN program, which identifies complete exon/intron structures in genomic DNA. GENSCAN can predict multiple genes in a sequence, handle partial or complete genes, and predict consistent gene sets on either or both DNA strands. It has higher accuracy than existing methods, identifying 75-80% of exons exactly. The program also provides reliability estimates for predicted exons. Accuracy is consistent across different C+G content sequences and vertebrate groups.
GENSCAN was tested on the Burset/Guigó set of 570 vertebrate multi-exon gene sequences. It outperformed existing programs in terms of sensitivity, specificity, and correlation coefficients. It also performed well on sequences with varying C+G content and across different vertebrate species. The program's accuracy was further validated on independent test sets, showing similar performance to the Burset/Guigó set. GENSCAN was able to predict complex genes, including the human gastric (H+K+)-ATPase gene with 22 coding exons. It was also effective in predicting genes in long genomic contigs, such as the CD4 gene region of human chromosome 12p13.
The model uses a probabilistic approach to capture the structural and compositional features of human genes. It includes distinct model parameters for different C+G compositional regions and accounts for variations in gene density and structure. The model is implemented in GENSCAN, which can predict multiple genes in a sequence, handle partial or complete genes, and predict consistent gene sets on either or both DNA strands. It also includes a novel method, Maximal Dependence Decomposition, to model functional signals in DNA, capturing dependencies between signal positions.
The model was tested on various datasets, including the GeneParser test sets, showing similar performance to the Burset/Guigó set. It was also tested on a long human contig, the CD4 gene region, where it successfully predicted several genes. The program's accuracy was validated using BLASTP searches, showing that some predicted genes were similar to known proteins. The model's performance was robust across different C+G content sequences and vertebrate species.
The model's initial and transition probabilities were derived from data on gene density and structure across different C+G compositional regions. The model accounts for variations in gene density and structure, with higher gene density in C+G-rich regions. The model's state transitions were determined based on observed frequencies in the learning set. The model's state length distributions wereA probabilistic model is introduced for predicting complete gene structures in human genomic DNA. The model incorporates transcriptional, translational, and splicing signals, as well as exon, intron, and intergenic region characteristics. It accounts for differences in gene density and structure across distinct C+G compositional regions of the human genome. New models of donor and acceptor splice signals are described, capturing potential dependencies between signal positions. The model is applied to the GENSCAN program, which identifies complete exon/intron structures in genomic DNA. GENSCAN can predict multiple genes in a sequence, handle partial or complete genes, and predict consistent gene sets on either or both DNA strands. It has higher accuracy than existing methods, identifying 75-80% of exons exactly. The program also provides reliability estimates for predicted exons. Accuracy is consistent across different C+G content sequences and vertebrate groups.
GENSCAN was tested on the Burset/Guigó set of 570 vertebrate multi-exon gene sequences. It outperformed existing programs in terms of sensitivity, specificity, and correlation coefficients. It also performed well on sequences with varying C+G content and across different vertebrate species. The program's accuracy was further validated on independent test sets, showing similar performance to the Burset/Guigó set. GENSCAN was able to predict complex genes, including the human gastric (H+K+)-ATPase gene with 22 coding exons. It was also effective in predicting genes in long genomic contigs, such as the CD4 gene region of human chromosome 12p13.
The model uses a probabilistic approach to capture the structural and compositional features of human genes. It includes distinct model parameters for different C+G compositional regions and accounts for variations in gene density and structure. The model is implemented in GENSCAN, which can predict multiple genes in a sequence, handle partial or complete genes, and predict consistent gene sets on either or both DNA strands. It also includes a novel method, Maximal Dependence Decomposition, to model functional signals in DNA, capturing dependencies between signal positions.
The model was tested on various datasets, including the GeneParser test sets, showing similar performance to the Burset/Guigó set. It was also tested on a long human contig, the CD4 gene region, where it successfully predicted several genes. The program's accuracy was validated using BLASTP searches, showing that some predicted genes were similar to known proteins. The model's performance was robust across different C+G content sequences and vertebrate species.
The model's initial and transition probabilities were derived from data on gene density and structure across different C+G compositional regions. The model accounts for variations in gene density and structure, with higher gene density in C+G-rich regions. The model's state transitions were determined based on observed frequencies in the learning set. The model's state length distributions were