Correlation analysis of binary similarity and dissimilarity measures
There are numerous binary similarity measures. Different binary similarity measures estimate different aspects of taxonomic relationships between two objects. This study presents the correlation of 76 binary similarity and dissimilarity measures used in many different fields. These measures are then grouped on the basis of their synthetic properties, arithmetic relationships, and chronological order. Five binary data sets from three different binary types are used to investigate the data dependency and the data invariance: a hypothetical random binary data set, three different sets of hypothetical equal random binary data set (the first set having 50% of 1s, the second 90% of 1s, and the third 90% of 0s), and a flattened nominal mushroom data set. This is the most extensive study of binary similarity measures ever conducted, analyzing the correlations of 2,850 pairs of measures, and comparing them in the five different data sets. Using the hierarchical clustering technique with the agglomerative single linkage, the 76 binary similarity and dissimilarity measures are clustered by their similarity values. In addition, the correlations of the binary similarity measures are quantified based on the similarity values computed from each pair. A variety of correlation patterns are discovered, and they are presented as 12 different correlation patterns as the data set dependent correlation and 3 types of correlation patterns as the data set invariant correlation. The correlation matrix representing 2,850 pairs of correlations is transformed to a gray-scaled square image for improved visualization and better understanding. Finally, the statistical significance tests are performed for the correlation matrices obtained from the five data sets. The distribution curves on each data set show that the correlations of binary similarity and dissimilarity measures are affected by the data set domains. ^ Keywords. binary similarity measure, distance measure, correlation, hierarchical cluster analysis ^
Seung-Seok (Seung) Choi,
"Correlation analysis of binary similarity and dissimilarity measures"
(January 1, 2008).
ETD Collection for Pace University.