Recognition of protein coding regions in DNA sequences

Recognition of protein coding regions in DNA sequences

1982 | James W. Fickett
Fickett proposed a test to distinguish protein-coding (PCS) from noncoding DNA sequences based on statistical differences between the two. The test is simple, objective, and has been validated on 400,000 bases of sequence data, misclassifying 5% of regions and giving "No Opinion" in 20% of cases. It uses eight parameters derived from base distribution in codon positions and overall base content. These parameters show clear differences between coding and noncoding DNA, with coding sequences having higher Position parameters and higher GC-content. The test, called TESTCODE, was validated on two halves of the Los Alamos Sequence Library, achieving 5% error rate. It can predict new coding and noncoding regions in published sequences. The method is reliable for sequences over 200 bases, and has potential applications in identifying ORFs, checking sequence libraries, and discovering new proteins. TESTCODE is not suitable for pinpointing exact coding boundaries but can be combined with other methods. It is insensitive to phase and requires additional methods to resolve overlapping ORFs. The test is useful for both experimentalists and theorists, and can be applied to fully coding or noncoding regions. The method is based on universal differences between coding and noncoding DNA, and has been tested on synthetic sequences, showing its reliability. The paper presents a detailed, objective method for recognizing coding sequences, which can be used by others to develop and test other methods.Fickett proposed a test to distinguish protein-coding (PCS) from noncoding DNA sequences based on statistical differences between the two. The test is simple, objective, and has been validated on 400,000 bases of sequence data, misclassifying 5% of regions and giving "No Opinion" in 20% of cases. It uses eight parameters derived from base distribution in codon positions and overall base content. These parameters show clear differences between coding and noncoding DNA, with coding sequences having higher Position parameters and higher GC-content. The test, called TESTCODE, was validated on two halves of the Los Alamos Sequence Library, achieving 5% error rate. It can predict new coding and noncoding regions in published sequences. The method is reliable for sequences over 200 bases, and has potential applications in identifying ORFs, checking sequence libraries, and discovering new proteins. TESTCODE is not suitable for pinpointing exact coding boundaries but can be combined with other methods. It is insensitive to phase and requires additional methods to resolve overlapping ORFs. The test is useful for both experimentalists and theorists, and can be applied to fully coding or noncoding regions. The method is based on universal differences between coding and noncoding DNA, and has been tested on synthetic sequences, showing its reliability. The paper presents a detailed, objective method for recognizing coding sequences, which can be used by others to develop and test other methods.
Reach us at info@futurestudyspace.com
[slides and audio] Recognition of protein coding regions in DNA sequences.