2018 | F. Alexander Wolf, Philipp Angerer and Fabian J. Theis
SCANPY is a scalable toolkit for analyzing single-cell gene expression data, offering methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks. It is implemented in Python and efficiently handles datasets with over a million cells. Alongside SCANPY, ANNDATA is introduced, a class for handling annotated data matrices. SCANPY integrates established R-based methods in a scalable and modular way, providing preprocessing similar to SEURAT and CELL RANGER, visualization via TSNE and diffusion maps, clustering similar to PHENOGRAPH, and pseudotemporal ordering via diffusion pseudotime. It is benchmarked against existing packages, showing significant speedups in analysis of large datasets. SCANPY's modular implementation allows efficient data handling and integration with advanced machine learning packages. It supports large-scale data analysis, with tools that enable interactive analysis of datasets with over a million cells. SCANPY is built around the ANNDATA class, which supports sparse data and HDF5-based storage, enabling efficient data processing without loading the entire dataset into memory. It also includes a graph class for neighborhood relations, improving computational efficiency. SCANPY is scalable, modular, and integrates well with the Python ecosystem. It is available on GitHub and the Python packaging index, with extensive documentation and examples. SCANPY addresses the increasing need for analyzing large datasets across different experimental setups, and is extendable and maintainable by a community. It supports data exchange across labs through the loom file format. SCANPY is compared with other Python packages for single-cell analysis, showing its advantages in scalability and functionality. It is available under the BSD3 license and is suitable for Linux, Mac OS, and Windows.SCANPY is a scalable toolkit for analyzing single-cell gene expression data, offering methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks. It is implemented in Python and efficiently handles datasets with over a million cells. Alongside SCANPY, ANNDATA is introduced, a class for handling annotated data matrices. SCANPY integrates established R-based methods in a scalable and modular way, providing preprocessing similar to SEURAT and CELL RANGER, visualization via TSNE and diffusion maps, clustering similar to PHENOGRAPH, and pseudotemporal ordering via diffusion pseudotime. It is benchmarked against existing packages, showing significant speedups in analysis of large datasets. SCANPY's modular implementation allows efficient data handling and integration with advanced machine learning packages. It supports large-scale data analysis, with tools that enable interactive analysis of datasets with over a million cells. SCANPY is built around the ANNDATA class, which supports sparse data and HDF5-based storage, enabling efficient data processing without loading the entire dataset into memory. It also includes a graph class for neighborhood relations, improving computational efficiency. SCANPY is scalable, modular, and integrates well with the Python ecosystem. It is available on GitHub and the Python packaging index, with extensive documentation and examples. SCANPY addresses the increasing need for analyzing large datasets across different experimental setups, and is extendable and maintainable by a community. It supports data exchange across labs through the loom file format. SCANPY is compared with other Python packages for single-cell analysis, showing its advantages in scalability and functionality. It is available under the BSD3 license and is suitable for Linux, Mac OS, and Windows.