2017 | Antonio Fabregat, Konstantinos Sidiropoulos, Guilherme Viteri, Oscar Forner, Pablo Marín-García, Vicente Arnau, Peter D'Eustachio, Lincoln Stein, Henning Hermjakob
This paper presents a high-performance in-memory implementation of the over-representation analysis (ORA) method for pathway analysis in Reactome, a curated and peer-reviewed knowledge base of biomolecular pathways. The ORA method is divided into four steps, each using specific data structures to optimize performance and minimize memory usage:
1. **Identifier Search**: A radix tree is used to quickly check if user identifiers correspond to entities in Reactome.
2. **Entity Modeling**: A graph is used to model proteins, chemicals, their orthologs, and their composition in complexes and sets.
3. **Result Aggregation**: A double-linked tree is used to aggregate results and calculate statistics.
4. **Statistical Testing**: The final step involves calculating the statistical significance of pathway associations using the Binomial Test and Benjamini-Hochberg approach.
The implementation significantly improves the performance of Reactome's pathway analysis service, enabling the analysis of genome-wide datasets within seconds and supporting interactive exploration of high-throughput data. The approach is available via a web service and a user interface integrated into Reactome's Pathway Browser. The paper also compares Reactome's pathway analysis tools with those of other resources like GSEA, DAVID, PANTHER, and ConsensusPathDB, highlighting Reactome's strengths in performance, flexibility, and ease of integration.This paper presents a high-performance in-memory implementation of the over-representation analysis (ORA) method for pathway analysis in Reactome, a curated and peer-reviewed knowledge base of biomolecular pathways. The ORA method is divided into four steps, each using specific data structures to optimize performance and minimize memory usage:
1. **Identifier Search**: A radix tree is used to quickly check if user identifiers correspond to entities in Reactome.
2. **Entity Modeling**: A graph is used to model proteins, chemicals, their orthologs, and their composition in complexes and sets.
3. **Result Aggregation**: A double-linked tree is used to aggregate results and calculate statistics.
4. **Statistical Testing**: The final step involves calculating the statistical significance of pathway associations using the Binomial Test and Benjamini-Hochberg approach.
The implementation significantly improves the performance of Reactome's pathway analysis service, enabling the analysis of genome-wide datasets within seconds and supporting interactive exploration of high-throughput data. The approach is available via a web service and a user interface integrated into Reactome's Pathway Browser. The paper also compares Reactome's pathway analysis tools with those of other resources like GSEA, DAVID, PANTHER, and ConsensusPathDB, highlighting Reactome's strengths in performance, flexibility, and ease of integration.