(1997) 268, 78–94 | Chris Burge* and Samuel Karlin
The paper introduces a probabilistic model for predicting gene structures in human genomic sequences, incorporating transcriptional, translational, and splicing signals, as well as compositional features of exons, introns, and intergenic regions. The model accounts for differences in gene density and structure across distinct C + G compositional regions of the human genome. GENSCAN, a computer program based on this model, is described, which can identify complete exon/intron structures in genomic DNA. GENSCAN is shown to have higher accuracy than existing methods, with 75 to 80% of exons identified exactly. The program can also predict multiple genes in a sequence, handle partial and complete genes, and predict consistent sets of genes on both strands of DNA. The model includes a novel method, Maximal Dependence Decomposition, to capture dependencies between signal positions in donor and acceptor splice signals. GENSCAN's performance is validated on standardized sets of human and vertebrate genes, demonstrating its effectiveness in predicting novel genes in long genomic contigs.The paper introduces a probabilistic model for predicting gene structures in human genomic sequences, incorporating transcriptional, translational, and splicing signals, as well as compositional features of exons, introns, and intergenic regions. The model accounts for differences in gene density and structure across distinct C + G compositional regions of the human genome. GENSCAN, a computer program based on this model, is described, which can identify complete exon/intron structures in genomic DNA. GENSCAN is shown to have higher accuracy than existing methods, with 75 to 80% of exons identified exactly. The program can also predict multiple genes in a sequence, handle partial and complete genes, and predict consistent sets of genes on both strands of DNA. The model includes a novel method, Maximal Dependence Decomposition, to capture dependencies between signal positions in donor and acceptor splice signals. GENSCAN's performance is validated on standardized sets of human and vertebrate genes, demonstrating its effectiveness in predicting novel genes in long genomic contigs.