2017 | John Vivian, Arjun Arkal Rao, Frank Austin Nothaft, Christopher Ketchum, Joel Armstrong, Adam Novak, Jacob Pfeil, Jake Narkizian, Alden D Deran, Audrey Musselman-Brown, Hannes Schmidt, Peter Amstutz, Brian Craft, Mary Goldman, Kate Rosenbloom, Melissa Cline, Brian O'Connor, Megan Hanna, Chet Birger, W James Kent, David A Patterson, Anthony D Joseph, Jingchun Zhu, Sasha Zaranek, Gad Getz, David Haussler, Benedict Paten
Toil is an open-source, portable workflow software designed for large-scale biomedical data analysis in cloud and high-performance computing (HPC) environments. It enables reproducible, efficient processing of genomic datasets, which often include tens of thousands of samples and petabytes of sequencing data. Toil addresses challenges in handling such large-scale data by providing robust software with features for fault tolerance, cloud and HPC support, and efficient processing of petabyte-scale datasets. It supports common workflow languages like CWL and WDL, and offers a Python API for static or dynamic workflow declaration. Toil can run on various environments, including AWS, Azure, Google Cloud, OpenStack, and HPC systems, and is compatible with different job stores like S3 or network file systems.
Toil's portability is achieved through pluggable backend APIs for machine provisioning, job scheduling, and file management. It includes performance optimizations, such as a leader/worker pattern for job scheduling, file caching, and data streaming to reduce I/O bottlenecks. Toil is robust to job failures and can utilize low-cost, preemptable machines, significantly reducing costs. For example, using AWS preemptable cores reduced RNA-seq processing costs by 2.5-fold.
Toil was demonstrated by processing over 20,000 RNA-seq samples from four studies, achieving a 30-fold reduction in cost and time compared to traditional methods. It supports workflows that can be run on private HPC clusters and is compatible with various data formats and tools. The software includes a repository of genomic workflows and integrates with Apache Spark for efficient processing. Toil also ensures data security through encryption and secure storage.
The study highlights Toil's ability to enable large-scale, reproducible, and cost-effective biomedical data analysis across diverse environments, supporting open-source standards and facilitating research in genomics and beyond.Toil is an open-source, portable workflow software designed for large-scale biomedical data analysis in cloud and high-performance computing (HPC) environments. It enables reproducible, efficient processing of genomic datasets, which often include tens of thousands of samples and petabytes of sequencing data. Toil addresses challenges in handling such large-scale data by providing robust software with features for fault tolerance, cloud and HPC support, and efficient processing of petabyte-scale datasets. It supports common workflow languages like CWL and WDL, and offers a Python API for static or dynamic workflow declaration. Toil can run on various environments, including AWS, Azure, Google Cloud, OpenStack, and HPC systems, and is compatible with different job stores like S3 or network file systems.
Toil's portability is achieved through pluggable backend APIs for machine provisioning, job scheduling, and file management. It includes performance optimizations, such as a leader/worker pattern for job scheduling, file caching, and data streaming to reduce I/O bottlenecks. Toil is robust to job failures and can utilize low-cost, preemptable machines, significantly reducing costs. For example, using AWS preemptable cores reduced RNA-seq processing costs by 2.5-fold.
Toil was demonstrated by processing over 20,000 RNA-seq samples from four studies, achieving a 30-fold reduction in cost and time compared to traditional methods. It supports workflows that can be run on private HPC clusters and is compatible with various data formats and tools. The software includes a repository of genomic workflows and integrates with Apache Spark for efficient processing. Toil also ensures data security through encryption and secure storage.
The study highlights Toil's ability to enable large-scale, reproducible, and cost-effective biomedical data analysis across diverse environments, supporting open-source standards and facilitating research in genomics and beyond.