[slides and audio] Clustering by compression

Clustering by Compression introduces a new method for clustering based on compression. The method uses a universal similarity distance, the normalized compression distance (NCD), derived from the lengths of compressed data files. It does not rely on subject-specific features or background knowledge and is robust across different application areas. The NCD is based on a normal compressor, which is a lossless encoder that satisfies certain properties like symmetry and monotonicity. The method uses a hierarchical clustering approach based on a new quartet method and a fast heuristic to implement it. The NCD is shown to be a similarity metric that approximates optimality. The method has been implemented as public software and has been successfully applied in various domains, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The NCD is robust under different compressors and has been shown to capture dominant similarities between objects. The method is based on the theory of normalized information distance, which is provably optimal in the sense that it minorizes every computable normalized metric that satisfies a certain density requirement. However, the optimality comes at the price of using the non-computable notion of Kolmogorov complexity. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term. The method has been applied to various data sets, including whole-genome phylogeny, language trees, and music clustering, and has been shown to perform well in these applications. The NCD is a non-negative number between 0 and 1 that represents how different two files are. Smaller numbers represent more similar files. The method has been shown to be robust under different compressors and has been successfully applied in various domains. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term. The method has been implemented as public software and has been successfully applied in various domains, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term.Clustering by Compression introduces a new method for clustering based on compression. The method uses a universal similarity distance, the normalized compression distance (NCD), derived from the lengths of compressed data files. It does not rely on subject-specific features or background knowledge and is robust across different application areas. The NCD is based on a normal compressor, which is a lossless encoder that satisfies certain properties like symmetry and monotonicity. The method uses a hierarchical clustering approach based on a new quartet method and a fast heuristic to implement it. The NCD is shown to be a similarity metric that approximates optimality. The method has been implemented as public software and has been successfully applied in various domains, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The NCD is robust under different compressors and has been shown to capture dominant similarities between objects. The method is based on the theory of normalized information distance, which is provably optimal in the sense that it minorizes every computable normalized metric that satisfies a certain density requirement. However, the optimality comes at the price of using the non-computable notion of Kolmogorov complexity. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term. The method has been applied to various data sets, including whole-genome phylogeny, language trees, and music clustering, and has been shown to perform well in these applications. The NCD is a non-negative number between 0 and 1 that represents how different two files are. Smaller numbers represent more similar files. The method has been shown to be robust under different compressors and has been successfully applied in various domains. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term. The method has been implemented as public software and has been successfully applied in various domains, including genomics, virology, languages, literature, music, handwritten digits, and astronomy. The NCD is a similarity metric that approximates optimality and has been shown to be quasi-universal in the sense that it minimizes every computable similarity metric up to an additive error term.

Clustering by Compression

9 Apr 2004 | Rudi Cilibrasi, Paul Vitanyi