SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

October 5, 2016 | Wei Shen, Shuai Le, Yan Li, Fuquan Hu
SeqKit is a cross-platform and ultrafast toolkit for manipulating FASTA/Q files. It provides executable binary files for Windows, Linux, and Mac OS X, and can be used directly without dependencies or configurations. It offers competitive performance in execution time and memory usage compared to similar tools. SeqKit is open source and available on GitHub. The toolkit includes nineteen subcommands that provide completely independent functions for FASTA/Q manipulation. All subcommands support plain or gzip-compressed inputs and outputs from standard streams or local files. SeqKit uses a lightweight and high-performance bioinformatics package for FASTA/Q parsing, which is similar to the widely used klib (kseq.h). It seamlessly supports both FASTA and FASTQ formats, and file type is automatically detected. SeqKit uses multiple CPUs to accelerate computationally intensive processes. It uses a custom data structure and algorithm for reverse complementary sequence computation, resulting in a ~20× speedup compared to the map strategy. Most subcommands do not load all FASTA/Q records into memory, which reduces memory usage. Some subcommands, such as "sample", "split", "shuffle", and "sort", read files twice in two-pass mode, using the FASTA index for rapid access and reducing memory usage. SeqKit provides more comprehensive features compared to other tools. It supports searching sequences by pattern, locating sequence motifs, and identifying common sequences between multiple files. It also provides practical extended positioning strategies for obtaining subsequences by region. SeqKit can convert FASTA/Q to and from tabular format, which can be conveniently manipulated with other tabular format tools. SeqKit outperformed seqtk in processing time on the two scales of FASTA file parsing while maintaining reasonable peak memory usage. It archived approximately 85% speed of seqtk in FASTQ file parsing. SeqKit required far less time than all other software with reasonable memory usage for searching sequences by the ID list. When sampling by sequence number, seqtk and Seqkit showed similar computational speeds. However, seqmagick used far more memory than seqtk and Seqkit because it read the whole file into memory. SeqKit ran much faster than seqmagick and used less memory for removing duplicate sequences by sequence content. When getting subsequences from BED files, SeqKit and seqtk performed similarly in speed but used more memory. SeqKit used more memory than seqtk in all cases, but its peak memory usage is determined by the length of the longest sequence record. Considering the efficiency both in time and memory, SeqKit can meet the need for efficient manipulations of large FASTA and FASTQ files with the growth of data size.SeqKit is a cross-platform and ultrafast toolkit for manipulating FASTA/Q files. It provides executable binary files for Windows, Linux, and Mac OS X, and can be used directly without dependencies or configurations. It offers competitive performance in execution time and memory usage compared to similar tools. SeqKit is open source and available on GitHub. The toolkit includes nineteen subcommands that provide completely independent functions for FASTA/Q manipulation. All subcommands support plain or gzip-compressed inputs and outputs from standard streams or local files. SeqKit uses a lightweight and high-performance bioinformatics package for FASTA/Q parsing, which is similar to the widely used klib (kseq.h). It seamlessly supports both FASTA and FASTQ formats, and file type is automatically detected. SeqKit uses multiple CPUs to accelerate computationally intensive processes. It uses a custom data structure and algorithm for reverse complementary sequence computation, resulting in a ~20× speedup compared to the map strategy. Most subcommands do not load all FASTA/Q records into memory, which reduces memory usage. Some subcommands, such as "sample", "split", "shuffle", and "sort", read files twice in two-pass mode, using the FASTA index for rapid access and reducing memory usage. SeqKit provides more comprehensive features compared to other tools. It supports searching sequences by pattern, locating sequence motifs, and identifying common sequences between multiple files. It also provides practical extended positioning strategies for obtaining subsequences by region. SeqKit can convert FASTA/Q to and from tabular format, which can be conveniently manipulated with other tabular format tools. SeqKit outperformed seqtk in processing time on the two scales of FASTA file parsing while maintaining reasonable peak memory usage. It archived approximately 85% speed of seqtk in FASTQ file parsing. SeqKit required far less time than all other software with reasonable memory usage for searching sequences by the ID list. When sampling by sequence number, seqtk and Seqkit showed similar computational speeds. However, seqmagick used far more memory than seqtk and Seqkit because it read the whole file into memory. SeqKit ran much faster than seqmagick and used less memory for removing duplicate sequences by sequence content. When getting subsequences from BED files, SeqKit and seqtk performed similarly in speed but used more memory. SeqKit used more memory than seqtk in all cases, but its peak memory usage is determined by the length of the longest sequence record. Considering the efficiency both in time and memory, SeqKit can meet the need for efficient manipulations of large FASTA and FASTQ files with the growth of data size.
Reach us at info@study.space