Clustering by Compression

Clustering by Compression

9 Apr 2004 | Rudi Cilibrasi*, Paul Vitanyi†
The paper introduces a novel method for clustering data based on compression, which does not rely on subject-specific features or background knowledge. The method uses the normalized compression distance (NCD), a similarity metric derived from the lengths of compressed data files. The NCD is universal and can be applied across different application areas. The paper also presents a hierarchical clustering method using a new quartet method and a fast heuristic to implement it. The method is implemented as open-source software and has been successfully applied in various fields, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The authors demonstrate the universality and robustness of the method through experiments and provide evidence of its effectiveness in different domains. The NCD is quasi-universal, meaning it minorizes every computable similarity metric up to an additive error term that depends on the quality of the compressor's approximation of Kolmogorov complexity. The paper also discusses the theoretical foundations of the NCD, including the properties of normal compressors and the quasi-universality of the NCD.The paper introduces a novel method for clustering data based on compression, which does not rely on subject-specific features or background knowledge. The method uses the normalized compression distance (NCD), a similarity metric derived from the lengths of compressed data files. The NCD is universal and can be applied across different application areas. The paper also presents a hierarchical clustering method using a new quartet method and a fast heuristic to implement it. The method is implemented as open-source software and has been successfully applied in various fields, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The authors demonstrate the universality and robustness of the method through experiments and provide evidence of its effectiveness in different domains. The NCD is quasi-universal, meaning it minorizes every computable similarity metric up to an additive error term that depends on the quality of the compressor's approximation of Kolmogorov complexity. The paper also discusses the theoretical foundations of the NCD, including the properties of normal compressors and the quasi-universality of the NCD.
Reach us at info@study.space