Big Data: Astronomical or Genomical?

Big Data: Astronomical or Genomical?

July 7, 2015 | Zachary D. Stephens, Skylar Y. Lee, Faraz Faghih, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, Gene E. Robinson
Big Data: Astronomical or Genomical? Genomics is a Big Data science that is expected to grow significantly in the coming years. The authors compare genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. They find that genomics is a "four-headed beast" and is either on par with or the most demanding of the domains analyzed in terms of data acquisition, storage, distribution, and analysis. They discuss the need for new technologies to meet the computational challenges that genomics poses for the near future. Genomics data acquisition is highly distributed and involves heterogeneous formats. The rate of growth over the last decade has been astonishing, with the total amount of sequence data produced doubling approximately every seven months. The number of sequencing instruments is also increasing, with over 2,500 high-throughput instruments located in nearly 1,000 sequencing centers in 55 countries. The raw sequencing reads used in most published studies are archived at the Sequence Read Archive (SRA), which currently contains over 3.6 petabases of raw sequence data. The storage requirements for all four domains are projected to be enormous. For genomics, more than 100 petabytes of storage are currently used by only 20 of the largest institutions. The storage needs for human genomes are expected to be as high as 2–40 exabytes by 2025. However, effective data compression can reduce these needs, although decompression times and fidelity are a major concern in compressive genomics. The distribution patterns of genomics data are much more heterogeneous, involving elements of both situations. For large-scale analysis, cloud computing is particularly suited to decreasing the bandwidth for distribution of genomic data. However, new methods of data reliability and security are required to ensure privacy, much more so than for the other three domains. The computational requirements for data analysis differ most among the four domains. Astronomy data require extensive specialized analysis, but the bulk of this requirement is for in situ processing and reduction of data by computers located near the telescopes. YouTube videos are primarily meant to be viewed, along with some automated analysis for advertisements or copyright infringements. Twitter data are the subject of intense research in the social sciences, especially for topic and sentiment mining. Analysis of genomic data involves a more diverse range of approaches because of the variety of steps involved in reading a genome sequence and deriving useful information from it. For population and medical genomics, identifying the genomic variants in each individual genome is currently one of the most computationally complex phases. Whole genome alignment is another important form of genomic data analysis, used for a variety of goals, from phylogeny reconstruction to genome annotation via comparative methodologies. The authors conclude that genomics poses the greatest challenges for data acquisition of the four Big Data domains. They discuss several key technological needs for Big Data genomics, including advances in sequencing technologies, fast and tiered storage systems, cloud-comBig Data: Astronomical or Genomical? Genomics is a Big Data science that is expected to grow significantly in the coming years. The authors compare genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. They find that genomics is a "four-headed beast" and is either on par with or the most demanding of the domains analyzed in terms of data acquisition, storage, distribution, and analysis. They discuss the need for new technologies to meet the computational challenges that genomics poses for the near future. Genomics data acquisition is highly distributed and involves heterogeneous formats. The rate of growth over the last decade has been astonishing, with the total amount of sequence data produced doubling approximately every seven months. The number of sequencing instruments is also increasing, with over 2,500 high-throughput instruments located in nearly 1,000 sequencing centers in 55 countries. The raw sequencing reads used in most published studies are archived at the Sequence Read Archive (SRA), which currently contains over 3.6 petabases of raw sequence data. The storage requirements for all four domains are projected to be enormous. For genomics, more than 100 petabytes of storage are currently used by only 20 of the largest institutions. The storage needs for human genomes are expected to be as high as 2–40 exabytes by 2025. However, effective data compression can reduce these needs, although decompression times and fidelity are a major concern in compressive genomics. The distribution patterns of genomics data are much more heterogeneous, involving elements of both situations. For large-scale analysis, cloud computing is particularly suited to decreasing the bandwidth for distribution of genomic data. However, new methods of data reliability and security are required to ensure privacy, much more so than for the other three domains. The computational requirements for data analysis differ most among the four domains. Astronomy data require extensive specialized analysis, but the bulk of this requirement is for in situ processing and reduction of data by computers located near the telescopes. YouTube videos are primarily meant to be viewed, along with some automated analysis for advertisements or copyright infringements. Twitter data are the subject of intense research in the social sciences, especially for topic and sentiment mining. Analysis of genomic data involves a more diverse range of approaches because of the variety of steps involved in reading a genome sequence and deriving useful information from it. For population and medical genomics, identifying the genomic variants in each individual genome is currently one of the most computationally complex phases. Whole genome alignment is another important form of genomic data analysis, used for a variety of goals, from phylogeny reconstruction to genome annotation via comparative methodologies. The authors conclude that genomics poses the greatest challenges for data acquisition of the four Big Data domains. They discuss several key technological needs for Big Data genomics, including advances in sequencing technologies, fast and tiered storage systems, cloud-com
Reach us at info@study.space