2019 December ; 16(12): 1289–1296 | Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yury Baglaenko, Michael Brenner, Po-ru Loh, Soumya Raychaudhuri
The paper introduces Harmony, an algorithm designed to integrate single-cell RNA-seq datasets, addressing the challenges of scaling to large datasets, identifying both broad and fine-grained cell populations, accommodating complex experimental designs, and integrating across modalities. Harmony projects cells into a shared embedding where they group by cell type rather than dataset-specific conditions, simultaneously accounting for multiple experimental and biological factors. The authors demonstrate Harmony's superior performance compared to other algorithms in six analyses, including cell lines, PBMCs from different protocols, pancreatic islet cells from multiple donors, mouse embryogenesis datasets, and cross-modality spatial integration. Harmony is computationally efficient, requiring significantly less memory and time compared to other methods, and can integrate large datasets (up to 10^6 cells) on personal computers. The algorithm uses a novel soft k-means clustering method to maximize diversity within clusters and a mixture model-based linear batch correction to remove batch effects. Harmony is available as an R package and can be applied to various single-cell datasets, enabling the identification of rare cell subtypes and the integration of time-course developmental trajectories and spatially resolved datasets.The paper introduces Harmony, an algorithm designed to integrate single-cell RNA-seq datasets, addressing the challenges of scaling to large datasets, identifying both broad and fine-grained cell populations, accommodating complex experimental designs, and integrating across modalities. Harmony projects cells into a shared embedding where they group by cell type rather than dataset-specific conditions, simultaneously accounting for multiple experimental and biological factors. The authors demonstrate Harmony's superior performance compared to other algorithms in six analyses, including cell lines, PBMCs from different protocols, pancreatic islet cells from multiple donors, mouse embryogenesis datasets, and cross-modality spatial integration. Harmony is computationally efficient, requiring significantly less memory and time compared to other methods, and can integrate large datasets (up to 10^6 cells) on personal computers. The algorithm uses a novel soft k-means clustering method to maximize diversity within clusters and a mixture model-based linear batch correction to remove batch effects. Harmony is available as an R package and can be applied to various single-cell datasets, enabling the identification of rare cell subtypes and the integration of time-course developmental trajectories and spatially resolved datasets.