The article discusses the development of high-quality, nearly error-free reference genomes for all vertebrate species. The international Genome 10K (G10K) consortium worked for five years to evaluate and develop cost-effective methods for assembling accurate and nearly complete reference genomes. They generated assemblies for 16 species representing six major vertebrate lineages, confirming that long-read sequencing technologies are essential for maximizing genome quality. Unresolved complex repeats and haplotype heterozygosity are major sources of assembly errors when not handled correctly. Their assemblies corrected substantial errors, added missing sequence in some of the best historical reference genomes, and revealed biological discoveries, including false gene duplications, increases in gene sizes, chromosome rearrangements, and a canonical GC-rich pattern in protein-coding genes.
The Vertebrate Genomes Project (VGP) was initiated to generate high-quality, complete reference genomes for all approximately 70,000 extant vertebrate species. The G10K consortium evaluated multiple genome sequencing and assembly approaches on one species, the Anna's hummingbird, and then deployed the best-performing method across sixteen species. They found that long-read sequencing technologies produced significantly longer contigs than short-read sequencing. After fixing a function in the PacBio FALCON software, contig NG50 nearly tripled. These findings are consistent with theoretical predictions and demonstrate that short reads alone cannot achieve high contig continuity.
The VGP assembly pipeline was developed with haplotype-separated CLR contigs, followed by scaffolding with linked reads, optical maps, and Hi-C, and then gap filling, base call polishing, and manual curation. The pipeline was tested on 15 additional species spanning all major vertebrate classes. The assemblies achieved high-quality metrics, including contig NG50, scaffold NG50, and Q40 average base quality. The assemblies revealed biological discoveries, including false gene duplications, increases in gene sizes, and chromosome rearrangements.
Repeats significantly affect assembly continuity. The G10K consortium found that contig NG50 decreased exponentially with increasing repeat content. After scaffolding and gap filling, they observed a significant positive correlation between repeat content and number of gaps. The VGP assemblies also revealed false duplications, which were identified and removed during curation using read coverage, self-, transcript-, optical map- and Hi-C-alignments, and k-mer profiles.
Curation is essential for completing high-quality reference assemblies. The G10K consortium found that automated scaffolding methods introduced tens to thousands of unique joins and breaks in contigs or scaffolds. Manual curation resulted in additional interventions for 19 genome assemblies. The VGP assemblies also revealed chromosome evolution, including chromosome rearrangements and the identification of new regulatory sequences in GC-rich promoter regions.
The VGP assemblies also revealed biological discoveries, including the first Bat1K study, which generated a genome-scale phylogeny that betterThe article discusses the development of high-quality, nearly error-free reference genomes for all vertebrate species. The international Genome 10K (G10K) consortium worked for five years to evaluate and develop cost-effective methods for assembling accurate and nearly complete reference genomes. They generated assemblies for 16 species representing six major vertebrate lineages, confirming that long-read sequencing technologies are essential for maximizing genome quality. Unresolved complex repeats and haplotype heterozygosity are major sources of assembly errors when not handled correctly. Their assemblies corrected substantial errors, added missing sequence in some of the best historical reference genomes, and revealed biological discoveries, including false gene duplications, increases in gene sizes, chromosome rearrangements, and a canonical GC-rich pattern in protein-coding genes.
The Vertebrate Genomes Project (VGP) was initiated to generate high-quality, complete reference genomes for all approximately 70,000 extant vertebrate species. The G10K consortium evaluated multiple genome sequencing and assembly approaches on one species, the Anna's hummingbird, and then deployed the best-performing method across sixteen species. They found that long-read sequencing technologies produced significantly longer contigs than short-read sequencing. After fixing a function in the PacBio FALCON software, contig NG50 nearly tripled. These findings are consistent with theoretical predictions and demonstrate that short reads alone cannot achieve high contig continuity.
The VGP assembly pipeline was developed with haplotype-separated CLR contigs, followed by scaffolding with linked reads, optical maps, and Hi-C, and then gap filling, base call polishing, and manual curation. The pipeline was tested on 15 additional species spanning all major vertebrate classes. The assemblies achieved high-quality metrics, including contig NG50, scaffold NG50, and Q40 average base quality. The assemblies revealed biological discoveries, including false gene duplications, increases in gene sizes, and chromosome rearrangements.
Repeats significantly affect assembly continuity. The G10K consortium found that contig NG50 decreased exponentially with increasing repeat content. After scaffolding and gap filling, they observed a significant positive correlation between repeat content and number of gaps. The VGP assemblies also revealed false duplications, which were identified and removed during curation using read coverage, self-, transcript-, optical map- and Hi-C-alignments, and k-mer profiles.
Curation is essential for completing high-quality reference assemblies. The G10K consortium found that automated scaffolding methods introduced tens to thousands of unique joins and breaks in contigs or scaffolds. Manual curation resulted in additional interventions for 19 genome assemblies. The VGP assemblies also revealed chromosome evolution, including chromosome rearrangements and the identification of new regulatory sequences in GC-rich promoter regions.
The VGP assemblies also revealed biological discoveries, including the first Bat1K study, which generated a genome-scale phylogeny that better