2004 | Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M.B. Vitányi
The paper introduces a new class of distances, called the "normalized information distance" (NID), which is based on the noncomputable concept of Kolmogorov complexity. The NID is shown to be universal, meaning it minorizes every computable distance in the class and captures all computable similarities. It is demonstrated that the NID is a metric and is called the "similarity metric." The authors develop a practical analogue of the NID, called the "normalized compression distance" (NCD), which uses real-world compressors like gzip and GenCompress. The NCD is tested in various applications, including whole mitochondrial genome phylogeny and language tree construction. The NCD is shown to be effective in reconstructing phylogenetic trees from mitochondrial genomes and constructing language trees for over 50 Euro-Asian languages. The paper also discusses the theoretical foundations of the NID and NCD, including the properties of Kolmogorov complexity and the definition of normalized distances. The NCD is proven to be a quasi-universal similarity metric relative to a normal reference compressor, and the ComPlearn Toolkit is introduced as a software tool for applying compression techniques to pattern discovery and learning.The paper introduces a new class of distances, called the "normalized information distance" (NID), which is based on the noncomputable concept of Kolmogorov complexity. The NID is shown to be universal, meaning it minorizes every computable distance in the class and captures all computable similarities. It is demonstrated that the NID is a metric and is called the "similarity metric." The authors develop a practical analogue of the NID, called the "normalized compression distance" (NCD), which uses real-world compressors like gzip and GenCompress. The NCD is tested in various applications, including whole mitochondrial genome phylogeny and language tree construction. The NCD is shown to be effective in reconstructing phylogenetic trees from mitochondrial genomes and constructing language trees for over 50 Euro-Asian languages. The paper also discusses the theoretical foundations of the NID and NCD, including the properties of Kolmogorov complexity and the definition of normalized distances. The NCD is proven to be a quasi-universal similarity metric relative to a normal reference compressor, and the ComPlearn Toolkit is introduced as a software tool for applying compression techniques to pattern discovery and learning.