機率校準

問你一個問題，如果給你一個二分類問題，假設是正或負兩類，然後你的模型對某個 sample 輸出預測為 $0.8$ ，你會認為他有機會是正類是 $80%$ 嗎?

答案是，如果你沒有經過校準，那會是錯誤的結論。

動機

如果你面對的只是要做圖片分類等等問題，那你可能只會在意要選什麼閥值就好但是如果你今天要做的是風險分析，或是給出預測報告，你就會希望給出的是一個正確的機率。

Calibration Curve

訓練好一個模型以後，有些模型可以給出機率預測 predict_proba，但有一些模型並沒有，只提供 decision_function而已，這時你可以使用

\[\frac{score - \min (score)}{\max (score) - \min (score)}\]

將它縮放到 $[0, 1]$ 區間，然後考慮 x 軸放預測的平均， y 軸放真實的正例的比例，這樣我們也就可以知道誤差有多少了。

下面看一個 calibration_curve 的使用範例。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve

np.random.seed(5087)
X, y = datasets.make_classification(n_samples=100000, 
                                    n_features=20,
                                    n_informative=2, 
                                    n_redundant=2)

train_samples = 100  # Samples used for training the models

X_train, X_test = X[:train_samples], X[train_samples:]
y_train, y_test = y[:train_samples], y[train_samples:]


# Create classifiers
classifier_lr  = LogisticRegression().fit(X_train, y_train)
classifier_gnb = GaussianNB().fit(X_train, y_train)
classifier_svc = LinearSVC(C=1.0).fit(X_train, y_train)
classifier_rfc = RandomForestClassifier().fit(X_train, y_train)


# #############################################################################
# Plot calibration plots

plt.figure(figsize=(8, 8))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

for classifier, name in [(classifier_lr, 'Logistic'),
                         (classifier_gnb, 'Naive Bayes'),
                         (classifier_svc, 'Support Vector Classification'),
                         (classifier_rfc, 'Random Forest')]:
    
    if hasattr(classifier, "predict_proba"):
        prob_pos = classifier.predict_proba(X_test)[:, 1]
    else:  # use decision function
        decision_score = classifier.decision_function(X_test)
        prob_pos = (decision_score - decision_score.min()) / (decision_score.max() - decision_score.min())
    fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_pos, n_bins=10)

    ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
             label="%s" % (name, ))

    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
             histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
plt.show()

如果你還是不太清楚，下面去到程式裡面來看。

calibration curve

下面給一個範例

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

np.random.seed(5087)
train_samples = 100
X, y = datasets.make_classification(n_samples=(train_samples*2), 
                                    n_features=20,
                                    n_informative=2, 
                                    n_redundant=2)

X_train, X_test = X[:train_samples], X[train_samples:]
y_train, y_test = y[:train_samples], y[train_samples:]
prob_pos        = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)[:, 1]


bins = np.linspace(0.0, 1.0, 10 + 1)     
print('bins: ', bins)               # 要分成幾份
binids = np.searchsorted(bins[1:-1], prob_pos)     
print('binids: ', binids)           # 把機率分到該去的那個 bins
bin_sums = np.bincount(binids, weights=prob_pos, minlength=len(bins))  
print('bin_sums: ', bin_sums)       # 每個 bin 區間的預測資料合
bin_true = np.bincount(binids, weights=y_test, minlength=len(bins))     
print('bin_true: ', bin_true)       # 每個 bin 區間的真實 positive 的數量
bin_total = np.bincount(binids, minlength=len(bins))        # 每個 bin 裡面有多少 data

nonzero = bin_total != 0
prob_true = bin_true[nonzero] / bin_total[nonzero]
print('prob_true: ', prob_true)       # 每個 bin 區間的真實 positive 的機率
prob_pred = bin_sums[nonzero] / bin_total[nonzero]
print('prob_pred: ', prob_pred)       # 每個 bin 區間的預測資料平均

OpenResource

My Resource Location

機率校準

動機

Calibration Curve