2013 | Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, Katherine H. Phillippy, Patti M. Sherman, Michelle Holko, Andrey Yefanov, HyeSeung Lee, Naigong Zhang, Cynthia L. Robertson, Nadezhda Serova, Sean Davis and Alexandra Soboleva
The Gene Expression Omnibus (GEO) is an international public repository for high-throughput microarray and next-generation sequencing (NGS) functional genomic data sets. It is maintained by the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, and provides free access to raw, processed, and metadata for all data. GEO offers web-based tools for querying, analyzing, and visualizing data. The article reports on recent developments, including the release of GEO2R, an R-based web application that helps users analyze GEO data.
GEO archives data from over 13,000 laboratories, comprising more than 800,000 samples from over 1,600 organisms. The number of submitted series has increased significantly, with over 6,800 new series processed in 2011. Data types archived in GEO reflect evolving trends in functional genomics, with 'expression profiling by array' being the most common study type, although its growth rate is slowing. Next-generation sequence submission rates have been rapidly increasing since 2008, with methods like ChIP-seq now submitted at a higher frequency than their array-based counterparts.
GEO supports next-generation sequence data by providing guidelines and tools for data submission and analysis. It hosts processed data files along with sample and study metadata, while raw data files are linked to NCBI's Sequence Read Archive (SRA). Over 44 terabases of read data have been loaded to SRA, and thousands of processed data files are incorporated into NCBI's Epigenomics database.
Recent updates to GEO include enhanced search, navigation, and analysis tools, such as sample records indexed as a distinct entry type, sample characteristics indexed under a new 'Attribute' field, and a 'similar studies' link. The 'find pathways' feature allows users to map genes to pathways in NCBI's BioSystems database. The 'GEO repository browser' has been redesigned to include more auxiliary information and links to related records.
GEO2R is a major update that allows users to perform sophisticated R-based analysis of GEO data to identify and visualize differentially expressed genes. It uses established Bioconductor R packages to transform and analyze data, presenting results as a table of genes ordered by significance and visualized with GEO Profile graphics. GEO2R does not rely on curated DataSet records and interrogates original submitter-supplied data directly, allowing over 90% of GEO studies to be analyzed this way.
GEO data are widely reused by the research community for various purposes, including supporting hypotheses, testing algorithms, identifying disease predictors, and developing value-added databases. The re-use rate is increasing, with more scientists using a data-driven approach to research. Ongoing challenges include expanding integration with related resources, procuring consistent sample annotations, and providing additional methods for analyzing next-generation sequence data.The Gene Expression Omnibus (GEO) is an international public repository for high-throughput microarray and next-generation sequencing (NGS) functional genomic data sets. It is maintained by the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, and provides free access to raw, processed, and metadata for all data. GEO offers web-based tools for querying, analyzing, and visualizing data. The article reports on recent developments, including the release of GEO2R, an R-based web application that helps users analyze GEO data.
GEO archives data from over 13,000 laboratories, comprising more than 800,000 samples from over 1,600 organisms. The number of submitted series has increased significantly, with over 6,800 new series processed in 2011. Data types archived in GEO reflect evolving trends in functional genomics, with 'expression profiling by array' being the most common study type, although its growth rate is slowing. Next-generation sequence submission rates have been rapidly increasing since 2008, with methods like ChIP-seq now submitted at a higher frequency than their array-based counterparts.
GEO supports next-generation sequence data by providing guidelines and tools for data submission and analysis. It hosts processed data files along with sample and study metadata, while raw data files are linked to NCBI's Sequence Read Archive (SRA). Over 44 terabases of read data have been loaded to SRA, and thousands of processed data files are incorporated into NCBI's Epigenomics database.
Recent updates to GEO include enhanced search, navigation, and analysis tools, such as sample records indexed as a distinct entry type, sample characteristics indexed under a new 'Attribute' field, and a 'similar studies' link. The 'find pathways' feature allows users to map genes to pathways in NCBI's BioSystems database. The 'GEO repository browser' has been redesigned to include more auxiliary information and links to related records.
GEO2R is a major update that allows users to perform sophisticated R-based analysis of GEO data to identify and visualize differentially expressed genes. It uses established Bioconductor R packages to transform and analyze data, presenting results as a table of genes ordered by significance and visualized with GEO Profile graphics. GEO2R does not rely on curated DataSet records and interrogates original submitter-supplied data directly, allowing over 90% of GEO studies to be analyzed this way.
GEO data are widely reused by the research community for various purposes, including supporting hypotheses, testing algorithms, identifying disease predictors, and developing value-added databases. The re-use rate is increasing, with more scientists using a data-driven approach to research. Ongoing challenges include expanding integration with related resources, procuring consistent sample annotations, and providing additional methods for analyzing next-generation sequence data.