Jump to content

Rand index

From Wikipedia, the free encyclopedia
Example clusterings for a dataset with thekMeans(left) andMean shift(right) algorithms. The calculated Adjusted Rand index for these two clusterings is

TheRand index[1]orRand measure(named after William M. Rand) instatistics,and in particular indata clustering,is a measure of the similarity between twodata clusterings.A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is theadjusted Rand index.The Rand index is theaccuracyof determining if a link belongs within a cluster or not.

Rand index

[edit]

Definition

[edit]

Given asetofelementsand twopartitionsofto compare,,a partition ofSintorsubsets, and,a partition ofSintossubsets, define the following:

  • ,the number of pairs of elements inthat are in thesamesubset inand in thesamesubset in
  • ,the number of pairs of elements inthat are indifferentsubsets inand indifferentsubsets in
  • ,the number of pairs of elements inthat are in thesamesubset inand indifferentsubsets in
  • ,the number of pairs of elements inthat are indifferentsubsets inand in thesamesubset in

The Rand index,,is:[1][2]

Intuitively,can be considered as the number of agreements betweenandandas the number of disagreements betweenand.

Since the denominator is the total number of pairs, the Rand index represents thefrequency of occurrence of agreements over the total pairs, or the probability thatand will agree on a randomly chosen pair.

is calculated as.

Similarly, one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:

whereis the number of true positives,is the number oftrue negatives,is the number offalse positives,andis the number offalse negatives.

Properties

[edit]

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

In mathematical terms, a, b, c, d are defined as follows:

  • ,where
  • ,where
  • ,where
  • ,where

for some

Relationship with classification accuracy

[edit]

The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in.The two class labels are "andare in the same subset inand"and"andare in different subsets inand".

In that setting,is the number of pairs correctly labeled as belonging to the same subset (true positives), andis the number of pairs correctly labeled as belonging to different subsets (true negatives).

Adjusted Rand index

[edit]

The adjusted Rand index is the corrected-for-chance version of the Rand index.[1][2][3]Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that inK-meansthe number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.[4]

Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.[5]

The contingency table

[edit]

Given a setSofnelements, and two groupings or partitions (e.g.clusterings) of these elements, namelyand,the overlap betweenXandYcan be summarized in a contingency tablewhere each entrydenotes the number of objects in common betweenand:.

Definition

[edit]

The original Adjusted Rand Index using the Permutation Model is

whereare values from the contingency table.

See also

[edit]

References

[edit]
  1. ^abcW. M. Rand (1971). "Objective criteria for the evaluation of clustering methods".Journal of the American Statistical Association.66(336). American Statistical Association: 846–850.doi:10.2307/2284239.JSTOR2284239.
  2. ^abLawrence Hubert and Phipps Arabie (1985). "Comparing partitions".Journal of Classification.2(1): 193–218.doi:10.1007/BF01908075.
  3. ^Nguyen Xuan Vinh, Julien Epps and James Bailey (2009)."Information Theoretic Measures for Clustering Comparison: Is a Correction for Chance Necessary?"(PDF).ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning.ACM. pp. 1073–1080.PDF.
  4. ^Alexander J Gates and Yong-Yeol Ahn (2017)."The Impact of Random Models on Clustering Similarity"(PDF).Journal of Machine Learning Research.18:1–28.
  5. ^"Comparing Clusterings - An Overview"(PDF).
[edit]