Cooler: scalable storage for Hi-C data and other genomically-labeled arrays

Cooler: scalable storage for Hi-C data and other genomically-labeled arrays

February 21, 2019 | Nezar Abdennur and Leonid Mirny
The paper introduces a scalable storage solution for high-resolution, multidimensional genomic datasets, particularly those generated by Hi-C and similar technologies. The authors propose a sparse data model and a file format called "cooler," which is based on HDF5. This format supports genomically-labeled matrices at any resolution, accommodating various data axes, resolutions, and metadata. The cooler format is flexible, efficient, and supports both sequential and random access, making it suitable for out-of-core data processing algorithms. The paper also describes a Python library and command-line tools for creating, reading, and manipulating cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium and is available for installation from the Python Package Index or bioconda repository. The source code is maintained on GitHub.The paper introduces a scalable storage solution for high-resolution, multidimensional genomic datasets, particularly those generated by Hi-C and similar technologies. The authors propose a sparse data model and a file format called "cooler," which is based on HDF5. This format supports genomically-labeled matrices at any resolution, accommodating various data axes, resolutions, and metadata. The cooler format is flexible, efficient, and supports both sequential and random access, making it suitable for out-of-core data processing algorithms. The paper also describes a Python library and command-line tools for creating, reading, and manipulating cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium and is available for installation from the Python Package Index or bioconda repository. The source code is maintained on GitHub.
Reach us at info@study.space
[slides and audio] Cooler%3A scalable storage for Hi-C data and other genomically-labeled arrays