[slides] The similarity metric

The paper introduces a new class of similarity distances for comparing sequences, focusing on the "normalized information distance" (NID), which is a universal similarity measure based on Kolmogorov complexity. The NID is shown to be a metric and minorizes all computable distances in the class, making it universal for discovering all similarities. A practical approximation of NID, called the "normalized compression distance" (NCD), is developed using real-world compressors like gzip and GenCompress. The NCD is tested in various applications, including constructing phylogenetic trees for whole mitochondrial genomes and language trees for 52 languages. The NCD is shown to be a similarity distance that satisfies metric properties and approximates universality. The paper also discusses the theoretical foundations of NID, its properties, and its applications in bioinformatics and linguistics. The NCD is demonstrated to be effective in comparing sequences across different domains, including genomes, languages, and music. The paper concludes that the NID and NCD provide a robust and general framework for measuring similarity between sequences.The paper introduces a new class of similarity distances for comparing sequences, focusing on the "normalized information distance" (NID), which is a universal similarity measure based on Kolmogorov complexity. The NID is shown to be a metric and minorizes all computable distances in the class, making it universal for discovering all similarities. A practical approximation of NID, called the "normalized compression distance" (NCD), is developed using real-world compressors like gzip and GenCompress. The NCD is tested in various applications, including constructing phylogenetic trees for whole mitochondrial genomes and language trees for 52 languages. The NCD is shown to be a similarity distance that satisfies metric properties and approximates universality. The paper also discusses the theoretical foundations of NID, its properties, and its applications in bioinformatics and linguistics. The NCD is demonstrated to be effective in comparing sequences across different domains, including genomes, languages, and music. The paper concludes that the NID and NCD provide a robust and general framework for measuring similarity between sequences.

The Similarity Metric

2004 | Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M.B. Vitányi