Understanding GENCODE%3A The reference human genome annotation for The ENCODE Project

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA (lncRNA) loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of lncRNA loci publicly available with the predominant transcript form consisting of two exons. The consortium examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated lncRNA loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. The GENCODE gene set is a combination of manual gene annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group and automatic gene annotation from Ensembl. It is updated with every Ensembl release. The manual annotation process involves annotating transcripts aligned to the genome and using genomic sequences as the reference. The automatic annotation process uses the Ensembl gene annotation pipeline and includes data from UniProt, RefSeq, and other sources. The GENCODE gene merge process combines HAVANA and Ensembl annotations, with a new module called HavanaAdder used to produce the merged gene set. The genes in the GENCODE reference gene set are classified into three levels based on their annotation type. Level 1 includes transcripts manually annotated and experimentally validated, Level 2 includes manually annotated transcripts, and Level 3 includes transcripts from Ensembl's automated annotation pipeline. The GENCODE gene set includes 9019 transcripts at Level 1, 118,657 transcripts at Level 2, and 33,699 transcripts at Level 3. The lncRNA data set in GENCODE 7 consists of 5058 lincRNA loci, 3214 antisense loci, 378 sense intronic loci, and 930 processed transcripts loci. The lncRNA data set is larger than other available lncRNA data sets and shows limited intersection with them. The GENCODEThe GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA (lncRNA) loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of lncRNA loci publicly available with the predominant transcript form consisting of two exons. The consortium examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated lncRNA loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. The GENCODE gene set is a combination of manual gene annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group and automatic gene annotation from Ensembl. It is updated with every Ensembl release. The manual annotation process involves annotating transcripts aligned to the genome and using genomic sequences as the reference. The automatic annotation process uses the Ensembl gene annotation pipeline and includes data from UniProt, RefSeq, and other sources. The GENCODE gene merge process combines HAVANA and Ensembl annotations, with a new module called HavanaAdder used to produce the merged gene set. The genes in the GENCODE reference gene set are classified into three levels based on their annotation type. Level 1 includes transcripts manually annotated and experimentally validated, Level 2 includes manually annotated transcripts, and Level 3 includes transcripts from Ensembl's automated annotation pipeline. The GENCODE gene set includes 9019 transcripts at Level 1, 118,657 transcripts at Level 2, and 33,699 transcripts at Level 3. The lncRNA data set in GENCODE 7 consists of 5058 lincRNA loci, 3214 antisense loci, 378 sense intronic loci, and 930 processed transcripts loci. The lncRNA data set is larger than other available lncRNA data sets and shows limited intersection with them. The GENCODE

GENCODE: The reference human genome annotation for The ENCODE Project