[slides and audio] Dictionary learning for integrative%2C multimodal and scalable single-cell analysis

This preprint introduces 'bridge integration', a method for harmonizing single-cell datasets across different modalities by leveraging a multi-omic dataset as a molecular bridge. The approach uses dictionary learning to reconstruct unimodal datasets and transform them into a shared space, enabling accurate integration of transcriptomic data with chromatin accessibility, histone modifications, DNA methylation, and protein levels. The method also combines dictionary learning with sketching techniques to improve computational scalability, allowing the integration of 8.6 million human immune cell profiles from sequencing and mass cytometry experiments. The approach broadens the utility of single-cell reference datasets and facilitates comparisons across diverse molecular modalities. The study demonstrates the effectiveness of bridge integration in mapping scATAC-seq data onto scRNA-seq references, identifying rare and high-resolution subpopulations, and revealing cross-modality relationships. It also shows that bridge integration can robustly handle cases where the query dataset represents a subset of the reference. The method is validated through quantitative benchmarking analyses, showing that it outperforms other methods in terms of accuracy and computational efficiency. The study also introduces 'atomic sketch integration', a scalable approach that enables the integration of large compendiums of datasets without requiring intensive computation on the full set of cells. This method is demonstrated on human lung scRNA-seq data, where it enables the identification of ultra-rare populations and improves the identification of differentially expressed cell-type markers. The method is further applied to the integration of scRNA-seq and CyTOF data, revealing cross-modality insights and enabling the annotation of rare cell populations. The study concludes that dictionary learning enhances the scalability of integration and the ability to integrate and compare diverse molecular modalities. The approach is applicable to a wide variety of technologies and modalities, and has the potential to be valuable for both individual labs and larger consortia. The method is implemented as part of the Seurat R package and is freely available as open-source software.This preprint introduces 'bridge integration', a method for harmonizing single-cell datasets across different modalities by leveraging a multi-omic dataset as a molecular bridge. The approach uses dictionary learning to reconstruct unimodal datasets and transform them into a shared space, enabling accurate integration of transcriptomic data with chromatin accessibility, histone modifications, DNA methylation, and protein levels. The method also combines dictionary learning with sketching techniques to improve computational scalability, allowing the integration of 8.6 million human immune cell profiles from sequencing and mass cytometry experiments. The approach broadens the utility of single-cell reference datasets and facilitates comparisons across diverse molecular modalities. The study demonstrates the effectiveness of bridge integration in mapping scATAC-seq data onto scRNA-seq references, identifying rare and high-resolution subpopulations, and revealing cross-modality relationships. It also shows that bridge integration can robustly handle cases where the query dataset represents a subset of the reference. The method is validated through quantitative benchmarking analyses, showing that it outperforms other methods in terms of accuracy and computational efficiency. The study also introduces 'atomic sketch integration', a scalable approach that enables the integration of large compendiums of datasets without requiring intensive computation on the full set of cells. This method is demonstrated on human lung scRNA-seq data, where it enables the identification of ultra-rare populations and improves the identification of differentially expressed cell-type markers. The method is further applied to the integration of scRNA-seq and CyTOF data, revealing cross-modality insights and enabling the annotation of rare cell populations. The study concludes that dictionary learning enhances the scalability of integration and the ability to integrate and compare diverse molecular modalities. The approach is applicable to a wide variety of technologies and modalities, and has the potential to be valuable for both individual labs and larger consortia. The method is implemented as part of the Seurat R package and is freely available as open-source software.

Dictionary learning for integrative, multimodal, and scalable single-cell analysis

February 26, 2022 | Yuhan Hao, Tim Stuart, Madeline Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, Gesmira Molla, Shaista Madad, Carlos Fernandez-Granda, Rahul Satija