October 2007 | Samuel Levy, Granger Sutton, Pauline C. Ng, Lars Feuk, Aaron L. Halpern, Brian P. Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F. Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R. MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B. Stockwell, Alexia Tsiamouri, Vikas Bansal, Saul A. Kravitz, Dana A. Busam, Karen Y. Beeson, Tina C. McIntosh, Karin A. Remington, Jose F. Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin W. Frazier, Stephen W. Scherer, Robert L. Strausberg, J. Craig Venter
The diploid genome sequence of an individual human was generated using Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage. This sequence was compared to the National Center for Biotechnology Information (NCBI) human reference assembly, revealing over 4.1 million DNA variants, including 3.2 million single nucleotide polymorphisms (SNPs), 53,823 block substitutions, 292,102 heterozygous insertion/deletion (indel) events, 559,473 homozygous indels, 90 inversions, and numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events, but represents 74% of variant bases, highlighting the importance of non-SNP genetic alterations in defining the diploid genome structure. Additionally, 44% of genes were heterozygous for one or more variants. A novel haplotype assembly strategy enabled the spanning of 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information. The study also identified 4.1 million variants, 30% of which were novel, including SNPs, indels, inversions, segmental duplications, and more complex forms of DNA variation. These variants were used to build long-range haplotypes, covering 11,250 genes (58% of all genes). The study also identified 263,923 heterozygous indels spanning 635,314 bp, with sizes ranging from 1 to 321 bp. The characteristics of the indels detected, their distribution of sizes <5 bp, and the inverse relationship of the number of indels to length are similar to previous observations. The study also identified 559,473 homozygous indels spanning 5.9 Mb and ranging from 1 to 82,771 bp in length. The ratio of SNPs to indels is lower in the HuRef assembly than what is observed by the SeattleSNPs data, indicating that relatively fewer SNPs or relatively more indels are called. This is likely due to relatively more indels being identified. The study also identified 6,535 homozygous indels that are at least 100 bases in length for which both flanks of the indel can be located precisely on HuRef and NCBI assemblies. These comprise 3,431 insertionsThe diploid genome sequence of an individual human was generated using Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage. This sequence was compared to the National Center for Biotechnology Information (NCBI) human reference assembly, revealing over 4.1 million DNA variants, including 3.2 million single nucleotide polymorphisms (SNPs), 53,823 block substitutions, 292,102 heterozygous insertion/deletion (indel) events, 559,473 homozygous indels, 90 inversions, and numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events, but represents 74% of variant bases, highlighting the importance of non-SNP genetic alterations in defining the diploid genome structure. Additionally, 44% of genes were heterozygous for one or more variants. A novel haplotype assembly strategy enabled the spanning of 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information. The study also identified 4.1 million variants, 30% of which were novel, including SNPs, indels, inversions, segmental duplications, and more complex forms of DNA variation. These variants were used to build long-range haplotypes, covering 11,250 genes (58% of all genes). The study also identified 263,923 heterozygous indels spanning 635,314 bp, with sizes ranging from 1 to 321 bp. The characteristics of the indels detected, their distribution of sizes <5 bp, and the inverse relationship of the number of indels to length are similar to previous observations. The study also identified 559,473 homozygous indels spanning 5.9 Mb and ranging from 1 to 82,771 bp in length. The ratio of SNPs to indels is lower in the HuRef assembly than what is observed by the SeattleSNPs data, indicating that relatively fewer SNPs or relatively more indels are called. This is likely due to relatively more indels being identified. The study also identified 6,535 homozygous indels that are at least 100 bases in length for which both flanks of the indel can be located precisely on HuRef and NCBI assemblies. These comprise 3,431 insertions