Skip to the content.

Clustering performance evaluation

這次要介紹在分群 clustering 任務上的評價指標。

rand index

假設有一個集合

\[S=\{ s_1, s_2, \cdots, s_n \}\]

我們有兩個分群的方法

\[X=\{ X_1,X_2, \cdots, X_p\}, Y=\{ Y_1,Y_2, \cdots, Y_q\}\]

我們先定義下面四個變數

我們就可以去定義 RI (rand index)

\[RI := \frac{a+b}{a+b+c+d} = \frac{a+b}{C^n_2}\]

我們可以用集合的概念表示 $a,b,c,d$ 。

\[a = \#\{(s_i,s_j) | s_i, s_j \in X_k, s_i,s_j \in Y_l \}\] \[b = \#\{(s_i,s_j) | s_i \in X_{k_1}, s_j \in X_{k_2}, s_i \in Y_{l_1}, s_j \in Y_{l_2} \}\] \[c = \#\{(s_i,s_j) | s_i, s_j \in X_k, s_i \in Y_{l_1}, s_j \in Y_{l_2} \}\] \[d = \#\{(s_i,s_j) | s_i \in X_{k_1}, s_j \in X_{k_2}, s_i,s_j \in Y_l \}\]

我們也可以用以前教過的概念來看,

那 RI 就可以理解為

\[RI := \frac{TP+TN}{TP+TN+FN+FP}\]

再來要介紹 Adjusted Rand Index , 他可以理解為修正機率版的 Rand Index , 為何要提出 ARI 而不去用 RI, 這是因為有人提出在給隨機分群的時候 RI 不靠近 0 的問題,所以 Hubert 和Arabie 在1985年提出了修正辦法, 大家可以參考

下面要定義一堆符號來講他的公式, 假設 $n_{ij} := # X_i \cap Y_j$

X\Y $Y_1 \quad Y_2 \quad \cdots \quad Y_q$ sum
$X_1$ $n_{11} \quad n_{12} \quad \cdots \quad n_{1q}$ $a_1$
$X_2$ $n_{21} \quad n_{22} \quad \cdots \quad n_{2q}$ $a_2$
$\vdots$ $\vdots \quad \vdots \qquad \ddots \quad \vdots$ $\vdots$
$X_p$ $n_{p1} \quad n_{p2} \quad \cdots \quad n_{pq}$ $a_p$
sum $b_1 \quad b_2 \quad \cdots \quad b_q$  

我們就可以定義 ARI 為

\[ARI := \frac{RI -E[RI]}{\max(RI)-E[RI]} = \frac{\sum_{ij}C^{n_{ij}}_2 - [t_1 \cdot t_2]/C^n_2}{\frac{t_1 + t_2}{2} - [t_1 \cdot t_2]/C^n_2}\]

where $t_1 = \sum_i C^{a_i}_2, t_2 = \sum_j C^{b_j}_2$。

下面我們實際舉個例子來看看怎麼計算, 假設 $S={1,2,3,4,5}$

\[X=\{ X_1=\{1,2,3\}, X_2=\{4,5\} \}\] \[Y=\{ Y_1=\{1,2\}, Y_2=\{3,4,5\} \}\]

可以得到 Table 長這樣

X\Y $Y_1$ $Y_2$ sum
$X_1$ ${1,2}$ ${3}$ $3$
$X_2$ $\emptyset$ ${4,5}$ $2$
sum $2$ $3$  

帶入公式可以算出 $t_1 = t_2 = (C^3_2 + C^2_2) = 4$,也知道 $C^5_2=10$,下面帶入就可以得到

\[ARI := \frac{2-1.6}{4-1.6} \sim 0.16666666666666663\]

下面來看看 scikit-learn 算的結果吧。

from sklearn.metrics import adjusted_rand_score

labels_true = [0, 0, 0, 1, 1]
labels_pred = [0, 0, 1, 1, 1]

display(adjusted_rand_score(labels_true, labels_pred))

下面來測試隨機給分群會接近 0 嗎?

import random
from sklearn.metrics import rand_score, adjusted_rand_score

n = 5

labels_true = [0] * n
labels_pred = [0] * n

_ = list(range(n))
random.shuffle(_)
break_index = random.randint(0, n)
# print(_[:break_index], _[break_index:])
for index in _[break_index:]:
    labels_true[index] = 1

_ = list(range(n))
random.shuffle(_)
break_index = random.randint(0, n)
for index in _[break_index:]:
    labels_pred[index] = 1

print('True: ', labels_true)
print('Pred: ', labels_pred)
print('Rand score: ', rand_score(labels_true, labels_pred))
print('Adjusted Rand score: ', adjusted_rand_score(labels_true, labels_pred))

import random
from sklearn.metrics import rand_score, adjusted_rand_score

n = 20
test_time = 10000
accumulation_RI  = 0
accumulation_ARI = 0

def random_split(n=5):
    _ = list(range(n))
    random.shuffle(_)
    break_index = random.randint(0, n)
    split_list = [1 if i in _[break_index:] else 0 for i in range(n)]
    return split_list

for i in range(test_time):
    labels_true = random_split(n)
    labels_pred = random_split(n)

    accumulation_RI += rand_score(labels_true, labels_pred)
    accumulation_ARI += adjusted_rand_score(labels_true, labels_pred)

accumulation_RI /= test_time
accumulation_ARI /= test_time

print('Average Rand score: ', accumulation_RI)
print('Average Adjusted Rand score: ', accumulation_ARI)

下面是 scikit-learn 上面的 rand_score 的實際使用 , 根據上面的說明也可以看出這個方法跟給標記的順序無關。

from sklearn.metrics import rand_score

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

display(rand_score(labels_true, labels_pred))

labels_pred = [1, 1, 0, 0, 3, 3]

display(rand_score(labels_true, labels_pred))

adjusted_rand_score

from sklearn.metrics import adjusted_rand_score

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

display(adjusted_rand_score(labels_true, labels_pred))

labels_pred = [1, 1, 0, 0, 3, 3]

display(adjusted_rand_score(labels_true, labels_pred))

mutual information

下面介紹另一個指標,我們沿用上面的定義假設全部的集合是 $S$ ,有兩個 分群分別是 $X$ 跟 $Y$,假設 $n_{ij} := # X_i \cap Y_j$

X\Y $Y_1 \quad Y_2 \quad \cdots \quad Y_q$ sum
$X_1$ $n_{11} \quad n_{12} \quad \cdots \quad n_{1q}$ $a_1$
$X_2$ $n_{21} \quad n_{22} \quad \cdots \quad n_{2q}$ $a_2$
$\vdots$ $\vdots \quad \vdots \qquad \ddots \quad \vdots$ $\vdots$
$X_p$ $n_{p1} \quad n_{p2} \quad \cdots \quad n_{pq}$ $a_p$
sum $b_1 \quad b_2 \quad \cdots \quad b_q$  

再來要引用 entropy 的概念,中文叫 熵。

\[H(X) = - \sum_{i=1}^p \frac{\# X_i}{n} \log( \frac{\# X_i}{n} )\] \[H(Y) = - \sum_{j=1}^q \frac{\# Y_j}{n} \log( \frac{\# Y_j}{n} )\]

然後我們就能定義 MI

\[MI(X,Y) = \sum_{i=1}^p \sum_{j=1}^q \frac{\# X_i \cap Y_j}{n} \log( \frac{N \# X_i \cap Y_j}{\# X_i \cdot \# Y_j} )\]

再來要定義 normalized mutual information

\[NMI(X,Y) = \frac{MI(X,Y)}{mean (H(X), H(Y))}\]

如果有跟上,看到這邊也許會發現出了什麼問題,所以有人提出adjusted mutual information

\[AMI(X,Y) = \frac{MI - E[MI]}{mean(H(X), H(V)) - E[MI]}\]
from sklearn.metrics import mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print('MI: ', mutual_info_score(labels_true, labels_pred))
print('NMI: ', normalized_mutual_info_score(labels_true, labels_pred))
print('AMI: ', adjusted_mutual_info_score(labels_true, labels_pred))

更多的指標可以參考