TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline

TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline

February 28, 2014 | Jeffrey C. Glaubitz, Terry M. Casstevens, Fei Lu, James Harriman, Robert J. Elshire, Qi Sun, Edward S. Buckler
TASSEL-GBS is a bioinformatics pipeline designed for efficient processing of raw genotyping by sequencing (GBS) data into SNP genotypes. It is suitable for high-throughput genotyping of large numbers of individuals at many SNP markers. The pipeline is designed to run on modest computing resources, such as desktop or laptop machines with 8–16 GB of RAM, and can handle studies ranging from small to extremely large, with up to 100,000 individuals. It is applicable in accelerated breeding contexts, where rapid turnaround from tissue collection to genotypes is required. The pipeline can use a reference genome or an unfinished "pseudo-reference" consisting of numerous contigs. It was benchmarked on a large-scale analysis of maize (Zea mays), where the average error rate was reduced to 0.0042 through population genetic-based SNP filters. The GBS assay and TASSEL-GBS pipeline provide robust tools for studying genomic diversity. The pipeline is implemented in the Java program TASSEL (version 4) and is tailored to the GBS protocols of Elshire et al. or Poland et al. It supports 15 single restriction enzymes and 15 restriction enzyme pairs, and new enzymes can be easily added. The pipeline is flexible and can handle various restriction enzymes and barcoding approaches. It is optimized for low sequencing depth (0.5 to 3×) over a large number of markers in a large sample of individuals. The pipeline can process very large data structures, such as the "TagsByTaxa" (TBT) object, which records the observed depth in each individual for each potentially useful sequence tag. The pipeline is designed to handle large numbers of markers and samples, and can be run on modest computing infrastructure. It avoids redundant alignment of identical reads by first collapsing all reads into a master tag list. The pipeline also favors allelic redundancy over quality scores, using the number of times a given tag has been observed as an indicator of sequence quality. Population genetic-based filtering of putative SNPs is used to remove error-prone SNPs, based on parameters such as minor allele frequency (MAF) and inbreeding coefficient (F_IT). The pipeline is suitable for use in genomics-assisted, accelerated breeding contexts, where rapid turnaround times from tissue collection to genotypes are essential. The pipeline has a capacity for very large analyses involving tens of thousands of samples, yet can also be run at smaller scales. It permits rapid processing of the data with a relatively modest memory footprint, allowing it to be run on desktop or laptop computers. This increases its usability by researchers in developing countries who may lack access to sophisticated computing resources. The separation of SNP discovery and genotyping into two phases reduces potential ascertainment biases and makes the TASSEL-GBS pipeline highly suitable for use in a genomics-assisted, accelerated breeding context. The high density of markers potentially available from the GTASSEL-GBS is a bioinformatics pipeline designed for efficient processing of raw genotyping by sequencing (GBS) data into SNP genotypes. It is suitable for high-throughput genotyping of large numbers of individuals at many SNP markers. The pipeline is designed to run on modest computing resources, such as desktop or laptop machines with 8–16 GB of RAM, and can handle studies ranging from small to extremely large, with up to 100,000 individuals. It is applicable in accelerated breeding contexts, where rapid turnaround from tissue collection to genotypes is required. The pipeline can use a reference genome or an unfinished "pseudo-reference" consisting of numerous contigs. It was benchmarked on a large-scale analysis of maize (Zea mays), where the average error rate was reduced to 0.0042 through population genetic-based SNP filters. The GBS assay and TASSEL-GBS pipeline provide robust tools for studying genomic diversity. The pipeline is implemented in the Java program TASSEL (version 4) and is tailored to the GBS protocols of Elshire et al. or Poland et al. It supports 15 single restriction enzymes and 15 restriction enzyme pairs, and new enzymes can be easily added. The pipeline is flexible and can handle various restriction enzymes and barcoding approaches. It is optimized for low sequencing depth (0.5 to 3×) over a large number of markers in a large sample of individuals. The pipeline can process very large data structures, such as the "TagsByTaxa" (TBT) object, which records the observed depth in each individual for each potentially useful sequence tag. The pipeline is designed to handle large numbers of markers and samples, and can be run on modest computing infrastructure. It avoids redundant alignment of identical reads by first collapsing all reads into a master tag list. The pipeline also favors allelic redundancy over quality scores, using the number of times a given tag has been observed as an indicator of sequence quality. Population genetic-based filtering of putative SNPs is used to remove error-prone SNPs, based on parameters such as minor allele frequency (MAF) and inbreeding coefficient (F_IT). The pipeline is suitable for use in genomics-assisted, accelerated breeding contexts, where rapid turnaround times from tissue collection to genotypes are essential. The pipeline has a capacity for very large analyses involving tens of thousands of samples, yet can also be run at smaller scales. It permits rapid processing of the data with a relatively modest memory footprint, allowing it to be run on desktop or laptop computers. This increases its usability by researchers in developing countries who may lack access to sophisticated computing resources. The separation of SNP discovery and genotyping into two phases reduces potential ascertainment biases and makes the TASSEL-GBS pipeline highly suitable for use in a genomics-assisted, accelerated breeding context. The high density of markers potentially available from the G
Reach us at info@study.space
[slides and audio] TASSEL-GBS%3A A High Capacity Genotyping by Sequencing Analysis Pipeline