February 21, 2019 | Nezar Abdennur and Leonid Mirny
The paper introduces a scalable storage solution for high-resolution, multidimensional genomic datasets, particularly those generated by Hi-C and similar technologies. The authors propose a sparse data model and a file format called "cooler," which is based on HDF5. This format supports genomically-labeled matrices at any resolution, accommodating various data axes, resolutions, and metadata. The cooler format is flexible, efficient, and supports both sequential and random access, making it suitable for out-of-core data processing algorithms. The paper also describes a Python library and command-line tools for creating, reading, and manipulating cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium and is available for installation from the Python Package Index or bioconda repository. The source code is maintained on GitHub.The paper introduces a scalable storage solution for high-resolution, multidimensional genomic datasets, particularly those generated by Hi-C and similar technologies. The authors propose a sparse data model and a file format called "cooler," which is based on HDF5. This format supports genomically-labeled matrices at any resolution, accommodating various data axes, resolutions, and metadata. The cooler format is flexible, efficient, and supports both sequential and random access, making it suitable for out-of-core data processing algorithms. The paper also describes a Python library and command-line tools for creating, reading, and manipulating cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium and is available for installation from the Python Package Index or bioconda repository. The source code is maintained on GitHub.