2017 April | Rob Patro¹,*, Geet Duggal²,†, Michael I Love³,†, Rafael A Irizarry³,§, and Carl Kingsford¹,⁴
Salmon is a fast and bias-aware method for quantifying transcript expression from RNA-seq reads. It improves accuracy by correcting for fragment GC content bias, which enhances the reliability of differential expression analysis. Salmon combines a dual-phase inference algorithm with feature-rich bias models and an ultra-fast read mapping procedure. It outperforms existing tools like kallisto and eXpress in accuracy and speed, using sample-specific bias models to account for sequence-specific, fragment-GC, and positional biases. Salmon's two-phase inference allows it to build a probabilistic model of the sequencing experiment, incorporating information not considered by other methods. It also estimates abundance uncertainty due to random sampling and multimapping reads. Salmon's lightweight mapping procedure tracks fragment positions and orientations, which are used to compute per-fragment conditional probabilities. Salmon's dual-phase inference and bias models improve inter-replicate concordance compared to other methods. It performs better in differential expression testing, with higher sensitivity at the same false discovery rate. Salmon's benefits persist at the gene level, reducing the number of genes called as differentially expressed. Salmon's approach is unique in combining experimental data models and efficient dual-phase inference. It is open-source and freely licensed (GPLv3), written in C++11, and available at https://github.com/COMBINE-lab/Salmon. Salmon's model accounts for sample-specific parameters and biases typical of RNA-seq data, including positional biases, sequence-specific biases at the 5' and 3' ends, fragment-level GC bias, strand-specific protocols, and fragment length distribution. Salmon's approach is efficient and scalable, capable of quantifying 600 million reads in 23 minutes using 30 threads. Salmon's ability to handle technical biases enables accurate interpretation of expression experiments in the context of large sequence databases. It is designed to take advantage of multiple CPU cores and is suitable for large-scale RNA-seq data analysis. Salmon's online and offline phases optimize transcript abundance estimates, with the online phase using stochastic variational Bayesian inference and the offline phase applying EM or variational Bayesian EM. Salmon's bias models account for sequence-specific and fragment GC biases, improving accuracy. Salmon's effective length model accounts for fragment size distribution and transcript length. Salmon's alignment model uses a spatially varying first-order Markov model over CIGAR symbols and nucleotides. Salmon's algorithms optimize transcript abundance estimates, with the online phase using a variant of stochastic collapsed variational Bayesian inference and the offline phase applying EM or variational Bayesian EM. Salmon's equivalence classes reduce computational complexity by grouping fragments that align to the same transcripts. Salmon's validation shows improved accuracy compared to other methods, with metrics like mean absolute relative difference (MARD) indicating better performance. Salmon's ability to handle bias correction and generate accurate abundance estimates makes it a valuable tool for RNA-seq analysis.Salmon is a fast and bias-aware method for quantifying transcript expression from RNA-seq reads. It improves accuracy by correcting for fragment GC content bias, which enhances the reliability of differential expression analysis. Salmon combines a dual-phase inference algorithm with feature-rich bias models and an ultra-fast read mapping procedure. It outperforms existing tools like kallisto and eXpress in accuracy and speed, using sample-specific bias models to account for sequence-specific, fragment-GC, and positional biases. Salmon's two-phase inference allows it to build a probabilistic model of the sequencing experiment, incorporating information not considered by other methods. It also estimates abundance uncertainty due to random sampling and multimapping reads. Salmon's lightweight mapping procedure tracks fragment positions and orientations, which are used to compute per-fragment conditional probabilities. Salmon's dual-phase inference and bias models improve inter-replicate concordance compared to other methods. It performs better in differential expression testing, with higher sensitivity at the same false discovery rate. Salmon's benefits persist at the gene level, reducing the number of genes called as differentially expressed. Salmon's approach is unique in combining experimental data models and efficient dual-phase inference. It is open-source and freely licensed (GPLv3), written in C++11, and available at https://github.com/COMBINE-lab/Salmon. Salmon's model accounts for sample-specific parameters and biases typical of RNA-seq data, including positional biases, sequence-specific biases at the 5' and 3' ends, fragment-level GC bias, strand-specific protocols, and fragment length distribution. Salmon's approach is efficient and scalable, capable of quantifying 600 million reads in 23 minutes using 30 threads. Salmon's ability to handle technical biases enables accurate interpretation of expression experiments in the context of large sequence databases. It is designed to take advantage of multiple CPU cores and is suitable for large-scale RNA-seq data analysis. Salmon's online and offline phases optimize transcript abundance estimates, with the online phase using stochastic variational Bayesian inference and the offline phase applying EM or variational Bayesian EM. Salmon's bias models account for sequence-specific and fragment GC biases, improving accuracy. Salmon's effective length model accounts for fragment size distribution and transcript length. Salmon's alignment model uses a spatially varying first-order Markov model over CIGAR symbols and nucleotides. Salmon's algorithms optimize transcript abundance estimates, with the online phase using a variant of stochastic collapsed variational Bayesian inference and the offline phase applying EM or variational Bayesian EM. Salmon's equivalence classes reduce computational complexity by grouping fragments that align to the same transcripts. Salmon's validation shows improved accuracy compared to other methods, with metrics like mean absolute relative difference (MARD) indicating better performance. Salmon's ability to handle bias correction and generate accurate abundance estimates makes it a valuable tool for RNA-seq analysis.