The problem of comparing two different partitions of a finite set of objects is a recurring issue in the clustering literature. This paper reviews a well-known measure of partition correspondence, the Rand index, and discusses the issue of correcting this index for chance. It notes that a recent normalization strategy is based on an incorrect assumption. The general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. These matrices are generated from corresponding partitions using various scoring rules. Special cases include traditional statistics and ones tailored to weight certain object pairs differently. The paper proposes a measure based on the comparison of object triples, which has a probabilistic interpretation, is corrected for chance, and is bounded between ±1.
The Rand index is a popular measure for comparing partitions. It is based on how object pairs are classified in a contingency table. There are four types of pairs: (i) objects in the same class in both partitions; (ii) objects in different classes in both partitions; (iii) objects in different classes in one partition and the same class in the other; and (iv) objects in the same class in one partition and different classes in the other. The Rand index calculates the proportion of pairs that are of type (i) or (ii), and subtracts the proportion of pairs of type (iii) or (iv). This index is then normalized to account for chance. The paper proposes a broader class of comparison measures under a common framework, including the Rand index as a special case. It also introduces a new measure based on the comparison of object triples, which has a probabilistic interpretation and is corrected for chance.The problem of comparing two different partitions of a finite set of objects is a recurring issue in the clustering literature. This paper reviews a well-known measure of partition correspondence, the Rand index, and discusses the issue of correcting this index for chance. It notes that a recent normalization strategy is based on an incorrect assumption. The general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. These matrices are generated from corresponding partitions using various scoring rules. Special cases include traditional statistics and ones tailored to weight certain object pairs differently. The paper proposes a measure based on the comparison of object triples, which has a probabilistic interpretation, is corrected for chance, and is bounded between ±1.
The Rand index is a popular measure for comparing partitions. It is based on how object pairs are classified in a contingency table. There are four types of pairs: (i) objects in the same class in both partitions; (ii) objects in different classes in both partitions; (iii) objects in different classes in one partition and the same class in the other; and (iv) objects in the same class in one partition and different classes in the other. The Rand index calculates the proportion of pairs that are of type (i) or (ii), and subtracts the proportion of pairs of type (iii) or (iv). This index is then normalized to account for chance. The paper proposes a broader class of comparison measures under a common framework, including the Rand index as a special case. It also introduces a new measure based on the comparison of object triples, which has a probabilistic interpretation and is corrected for chance.