February 1, 2021 | Gennady Korotkevich, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N. Artyomov, Alexey Sergushichev
The paper introduces FGSEA (Fast Gene Set Enrichment Analysis), a method for efficiently and accurately estimating gene set enrichment (GSEA) P-values. Traditional GSEA methods often struggle with small P-values due to the need for a large number of random gene sets, which can be time-consuming and resource-intensive. FGSEA addresses this issue by using two main procedures: FGSEA-simple and FGSEA-multilevel.
1. **FGSEA-simple**: This method efficiently estimates P-values for a collection of gene sets by reusing random gene set samples across different pathways. It calculates enrichment scores for prefixes of random gene sets and updates the most distant point in an amortized time complexity of \(O(K\sqrt{K})\), significantly reducing the computational time compared to naive approaches.
2. **FGSEA-multilevel**: This method is designed to estimate very low P-values for individual gene sets using an adaptive multi-level split Monte Carlo scheme. It involves dividing the enrichment scores into levels and estimating the probability of a random gene set having an enrichment score at least as high as the query pathway's score. The method uses the Metropolis algorithm to sample from conditional distributions, providing accurate estimates even for very small P-values.
The authors validate FGSEA using an exact algorithm for GSEA P-value calculation for integer gene-level statistics and demonstrate that FGSEA can estimate P-values up to \(10^{-100}\) with a small and predictable error. They also show that FGSEA recovers more statistically significant pathways compared to other implementations on a collection of 605 datasets. The method is open-source and available as an R package in Bioconductor and on GitHub.The paper introduces FGSEA (Fast Gene Set Enrichment Analysis), a method for efficiently and accurately estimating gene set enrichment (GSEA) P-values. Traditional GSEA methods often struggle with small P-values due to the need for a large number of random gene sets, which can be time-consuming and resource-intensive. FGSEA addresses this issue by using two main procedures: FGSEA-simple and FGSEA-multilevel.
1. **FGSEA-simple**: This method efficiently estimates P-values for a collection of gene sets by reusing random gene set samples across different pathways. It calculates enrichment scores for prefixes of random gene sets and updates the most distant point in an amortized time complexity of \(O(K\sqrt{K})\), significantly reducing the computational time compared to naive approaches.
2. **FGSEA-multilevel**: This method is designed to estimate very low P-values for individual gene sets using an adaptive multi-level split Monte Carlo scheme. It involves dividing the enrichment scores into levels and estimating the probability of a random gene set having an enrichment score at least as high as the query pathway's score. The method uses the Metropolis algorithm to sample from conditional distributions, providing accurate estimates even for very small P-values.
The authors validate FGSEA using an exact algorithm for GSEA P-value calculation for integer gene-level statistics and demonstrate that FGSEA can estimate P-values up to \(10^{-100}\) with a small and predictable error. They also show that FGSEA recovers more statistically significant pathways compared to other implementations on a collection of 605 datasets. The method is open-source and available as an R package in Bioconductor and on GitHub.