2012 | Jennifer Harrow, Adam Frankish, Jose M. Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L. Aken, Daniel Barrell, Amonida Zadissa, Stephen Searle, If Barnes, Alexandra Bignell, Veronika Boychenko, Toby Hunt, Mike Kay, Gaurab Mukherjee, Jeena Rajan, Gloria Despacio-Reyes, Gary Saunders, Charles Steward, Rachel Harte, Michael Lin, Cedric Howald, Andrea Tanzer, Thomas Derrien, Jacqueline Christ, Nathalie Walters, Suganthi Balasubramanian, Baikang Pei, Michael Tress, Jose Manuel Rodriguez, Jelte van Baren, Michael Brent, David Haussler, Manolis Kellis, Alfonso Valencia, Alexandre Reymond, Mark Gerstein, Roderic Guigo, and Tim J. Hubbard
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci, with 33,977 coding transcripts not represented in UCSC genes and RefSeq. The consortium has examined the completeness of the transcript annotation, finding that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3077 consist of two exon models, indicating they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. The consortium has also developed a pseudogene ontology to associate various biological properties with pseudogenes and a new method to differentiate the level of support for transcript annotations. The GENCODE gene set has evolved substantially between releases 3c and 7, with a significant decrease in the number of protein-coding loci and an increase in the number of alternative splicing transcripts. The consortium has also developed a new experimental validation pipeline to identify gene models with limited or lower confidence transcribed evidence.The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci, with 33,977 coding transcripts not represented in UCSC genes and RefSeq. The consortium has examined the completeness of the transcript annotation, finding that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3077 consist of two exon models, indicating they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. The consortium has also developed a pseudogene ontology to associate various biological properties with pseudogenes and a new method to differentiate the level of support for transcript annotations. The GENCODE gene set has evolved substantially between releases 3c and 7, with a significant decrease in the number of protein-coding loci and an increase in the number of alternative splicing transcripts. The consortium has also developed a new experimental validation pipeline to identify gene models with limited or lower confidence transcribed evidence.