机器学习sklearn之集成学习(二)

GBDT概述

GBDT算法在实际应用中深受大众的喜爱,不同于Adaboost(利用前一轮弱学习器的误差率来更新训练集的误差权重),GBDT采用的是前向分步算法,但是弱学习器则限定了CART决策树模型。

GBDT的通俗理解是:假如一所房屋的价格是100万,我们先用80万去拟合,发现损失20万,这时我们用15万去拟合发现损失5万,接着我们使用4万去拟合返现损失1万,这样一直迭代下去,使得损失误差一直减少到设定的阈值。

如何解决损失函数拟合方法的问题,大牛提出了使用损失函数的负梯度来拟合本轮损失的近似值,进而拟合一个CART回归树。由于公式编辑很费时间,这里给大家推荐一个博客,具体细节可参考

GBDT的优点有很多,总结如下:
1、由于采用了CART作为弱学习器,可以处理各种类型的数据,包括连续值和离散值
2、预测准确度较高
3、可以采用一些正则化的损失函数,对异常值的鲁棒性非常强

下面对GBDT的实际应用做个小小的demo。

使用决策树CART进行分类

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline

data = load_iris()
X = data.data
y = data.target

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

fig = plt.figure()
plt.scatter(x_train[y_train==0][:, 1], x_train[y_train==0][:, 2])
plt.scatter(x_train[y_train==1][:, 1], x_train[y_train==1][:, 2])
plt.scatter(x_train[y_train==2][:, 1], x_train[y_train==2][:, 2])
plt.legend(data.target_names)
plt.show()

在这里插入图片描述

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
import pydotplus
from sklearn import metrics

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test), target_names=data.target_names))
dot_data = export_graphviz(decision_tree=dtc, 
                           out_file=None, 
                           feature_names=data.feature_names, 
                           class_names=data.target_names, 
                           filled=True, 
                           rounded=True, 
                           special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Accuracy (train): 1
Accuracy (test): 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

png

使用GBDT进行集成学习

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)

print("Accuracy(train) : %.4g" % gbc.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gbc.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbc.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

参数调优,这里只是对决策树的深度进行交叉验证,能调的参数比较多,这里不一一举例

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

params = {"max_depth": list(range(1,11))}
gbc = GradientBoostingClassifier()
gs = GridSearchCV(gbc, param_grid=params, cv=10)
gs.fit(x_train, y_train)


print("Accuracy(train) : %.4g" % gs.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gs.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gs.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

交叉验证的结果

gs.best_estimator_, gs.best_score_, gs.best_params_, gs.grid_scores_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=4,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               presort='auto', random_state=None, subsample=1.0, verbose=0,
               warm_start=False),
 0.9642857142857143,
 {'max_depth': 4},
 [mean: 0.94643, std: 0.06325, params: {'max_depth': 1},
  mean: 0.94643, std: 0.07307, params: {'max_depth': 2},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 3},
  mean: 0.96429, std: 0.04335, params: {'max_depth': 4},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 5},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 6},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 7},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 8},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 9},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 10}])

使用集成优化Adaboost算法进行分类

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(x_train, y_train)
ada.score(x_test, y_test)

print("Accuracy(train) : %.4g" % ada.score(x_train, y_train))
print("Accuracy(test) : %.4g" % ada.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, ada.predict(x_test), target_names=data.target_names))
Accuracy(train) : 0.9821
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

使用Xgboost进行分类

import xgboost
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',  # 多分类的问题
    'num_class': 3,               # 类别数,与 multisoftmax 并用
    'gamma': 0.1,                  # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
    'max_depth': 3,               # 构建树的深度,越大越容易过拟合
    'lambda': 1,                   # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
    'subsample': 0.8,              # 随机采样训练样本
    'colsample_bytree': 0.7,       # 生成树时进行的列采样
    'min_child_weight': 3,
    'silent': 1,                   # 设置成1则没有运行信息输出,最好是设置为0.
    'eta': 0.01,                  # 如同学习率
    'seed': 1000,
    'nthread': 4,                  # cpu 线程数
}

dtrain = xgboost.DMatrix(x_train, y_train)
num_rounds = 500
model = xgboost.train(params=params,dtrain=dtrain, num_boost_round=num_rounds)
dtest = xgboost.DMatrix(x_test)
ans = model.predict(dtest)
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1
print("Accuracy(test): \n", cnt1/(cnt1 + cnt2))
Accuracy(test): 
 0.9473684210526315

一个更复杂的例子,数据量大,可以体现集成学习的优势

import pandas as pd
train = pd.read_csv("./train_modified.csv")
# train.head(10)
train.describe()
# train[train.isnull().values==True]
Disbursed Existing_EMI Loan_Amount_Applied Loan_Tenure_Applied Monthly_Income Var4 Var5 Age EMI_Loan_Submitted_Missing Interest_Rate_Missing ... Var2_2 Var2_3 Var2_4 Var2_5 Var2_6 Mobile_Verified_0 Mobile_Verified_1 Source_0 Source_1 Source_2
count 20000.000000 20000.00000 2.000000e+04 20000.000000 2.000000e+04 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 ... 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000
mean 0.016000 3890.29622 2.424191e+05 2.254450 4.575767e+04 2.966350 5.851350 30.878950 0.644000 0.644000 ... 0.239300 0.009300 0.064950 0.026450 0.000900 0.354100 0.645900 0.115500 0.645450 0.239050
std 0.125478 10534.21647 3.582973e+05 1.988467 4.575422e+05 1.575989 5.835997 6.829651 0.478827 0.478827 ... 0.426667 0.095989 0.246444 0.160473 0.029987 0.478252 0.478252 0.319632 0.478389 0.426514
min 0.000000 0.00000 0.000000e+00 0.000000 1.000000e+01 1.000000 0.000000 18.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.00000 0.000000e+00 0.000000 1.700000e+04 1.000000 0.000000 26.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.00000 1.000000e+05 2.000000 2.500000e+04 3.000000 3.000000 29.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 0.000000 4000.00000 3.000000e+05 4.000000 4.000000e+04 5.000000 11.000000 34.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000
max 1.000000 420000.00000 9.000000e+06 10.000000 5.495454e+07 7.000000 17.000000 65.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 50 columns

target = "Disbursed"
IDcol = "ID"
train["Disbursed"].value_counts()
0    19680
1      320
Name: Disbursed, dtype: int64
x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train["Disbursed"]

使用决策树CART进行分类

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test)))
Accuracy (train): 0.9997
Accuracy (test): 0.9682
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      0.98      0.98      4931
          1       0.04      0.06      0.05        69

avg / total       0.97      0.97      0.97      5000

使用集成学习GBDT进行分类

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
gbm0 = GradientBoostingClassifier(random_state=10)
gbm0.fit(x_train, y_train)
y_pred = gbm0.predict(x_test)
y_predprob = gbm0.predict_proba(x_test)[:,1]
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm0.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm0.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm0.predict(x_test)))
Accuracy (train): 0.9844
Accuracy (test): 0.9854
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4931
          1       0.00      0.00      0.00        69

avg / total       0.97      0.99      0.98      5000

对GBDT的n_estimators参数进行的调优

from sklearn.model_selection import GridSearchCV
params = {"n_estimators": range(20, 81, 10)}
gsearch1 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, 
                                                             min_samples_split=300,  # 限制子树继续划分的条件,如果某节点的样本数小于这个值,则不会继续尝试选择最优特征进行划分
                                                             min_samples_leaf=20,   # 限制叶子节点最少的样本数,如果少于这个值则会和兄弟节点一起被剪枝
                                                             max_depth=8,   # 决策树的最大深度
                                                             max_features="sqrt",   # 划分时考虑的最大特征数,“sqrt”或者“auto”意味着划分时最多考虑根号(N)个特征
                                                             subsample=0.8,  # 子采样,不放回采样,0.8表示只使用了80%的数据,可以防止过拟合
                                                             random_state=10),
                        param_grid=params,
                        scoring="roc_auc",
                        iid=False,
                        cv=5)
gsearch1.fit(x_train, y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.82240, std: 0.03087, params: {'n_estimators': 20},
  mean: 0.82469, std: 0.03055, params: {'n_estimators': 30},
  mean: 0.82479, std: 0.03178, params: {'n_estimators': 40},
  mean: 0.82445, std: 0.02968, params: {'n_estimators': 50},
  mean: 0.82230, std: 0.02993, params: {'n_estimators': 60},
  mean: 0.82074, std: 0.02881, params: {'n_estimators': 70},
  mean: 0.81918, std: 0.02904, params: {'n_estimators': 80}],
 {'n_estimators': 40},
 0.8247923927911079)

对GBDT的max_depth和min_samples_split参数进行的调优

params2 = {"max_depth": range(3, 14, 2), "min_samples_split": range(100, 801, 200)}
gsearch2 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, 
                                                             n_estimators=60, 
                                                             min_samples_leaf=20, 
                                                             max_features="sqrt", 
                                                             subsample=0.8, 
                                                             random_state=10),
                        param_grid=params2, 
                        scoring="roc_auc", 
                        iid=False, 
                        cv=5)
gsearch2.fit(x_train, y_train)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.81718, std: 0.03177, params: {'max_depth': 3, 'min_samples_split': 100},
  mean: 0.81821, std: 0.02824, params: {'max_depth': 3, 'min_samples_split': 300},
  mean: 0.81938, std: 0.02993, params: {'max_depth': 3, 'min_samples_split': 500},
  mean: 0.81850, std: 0.02894, params: {'max_depth': 3, 'min_samples_split': 700},
  mean: 0.82919, std: 0.02452, params: {'max_depth': 5, 'min_samples_split': 100},
  mean: 0.82704, std: 0.02582, params: {'max_depth': 5, 'min_samples_split': 300},
  mean: 0.82595, std: 0.02603, params: {'max_depth': 5, 'min_samples_split': 500},
  mean: 0.82930, std: 0.02581, params: {'max_depth': 5, 'min_samples_split': 700},
  mean: 0.82742, std: 0.02200, params: {'max_depth': 7, 'min_samples_split': 100},
  mean: 0.81882, std: 0.02066, params: {'max_depth': 7, 'min_samples_split': 300},
  mean: 0.82529, std: 0.02404, params: {'max_depth': 7, 'min_samples_split': 500},
  mean: 0.82395, std: 0.02940, params: {'max_depth': 7, 'min_samples_split': 700},
  mean: 0.82908, std: 0.02157, params: {'max_depth': 9, 'min_samples_split': 100},
  mean: 0.81857, std: 0.03291, params: {'max_depth': 9, 'min_samples_split': 300},
  mean: 0.82545, std: 0.02825, params: {'max_depth': 9, 'min_samples_split': 500},
  mean: 0.82815, std: 0.02859, params: {'max_depth': 9, 'min_samples_split': 700},
  mean: 0.81604, std: 0.02591, params: {'max_depth': 11, 'min_samples_split': 100},
  mean: 0.82513, std: 0.02261, params: {'max_depth': 11, 'min_samples_split': 300},
  mean: 0.82908, std: 0.03235, params: {'max_depth': 11, 'min_samples_split': 500},
  mean: 0.82534, std: 0.02583, params: {'max_depth': 11, 'min_samples_split': 700},
  mean: 0.81899, std: 0.02132, params: {'max_depth': 13, 'min_samples_split': 100},
  mean: 0.82667, std: 0.02806, params: {'max_depth': 13, 'min_samples_split': 300},
  mean: 0.82685, std: 0.03581, params: {'max_depth': 13, 'min_samples_split': 500},
  mean: 0.82662, std: 0.02611, params: {'max_depth': 13, 'min_samples_split': 700}],
 {'max_depth': 5, 'min_samples_split': 700},
 0.8292976819017346)

对GBDT的min_samples_leaf和min_samples_split参数进行的调优

param_test3 = {'min_samples_split':range(800,1900,200), 'min_samples_leaf':range(60,101,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, 
                                                               n_estimators=60,
                                                               max_depth=9,
                                                               max_features='sqrt',
                                                               subsample=0.8, 
                                                               random_state=10), 
                        param_grid = param_test3, 
                        scoring='roc_auc',
                        iid=False, 
                        cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.82938, std: 0.02746, params: {'min_samples_leaf': 60, 'min_samples_split': 800},
  mean: 0.82748, std: 0.03127, params: {'min_samples_leaf': 60, 'min_samples_split': 1000},
  mean: 0.82002, std: 0.03099, params: {'min_samples_leaf': 60, 'min_samples_split': 1200},
  mean: 0.82265, std: 0.03321, params: {'min_samples_leaf': 60, 'min_samples_split': 1400},
  mean: 0.82615, std: 0.02846, params: {'min_samples_leaf': 60, 'min_samples_split': 1600},
  mean: 0.82273, std: 0.02671, params: {'min_samples_leaf': 60, 'min_samples_split': 1800},
  mean: 0.82471, std: 0.03209, params: {'min_samples_leaf': 70, 'min_samples_split': 800},
  mean: 0.82705, std: 0.03119, params: {'min_samples_leaf': 70, 'min_samples_split': 1000},
  mean: 0.82525, std: 0.02723, params: {'min_samples_leaf': 70, 'min_samples_split': 1200},
  mean: 0.82698, std: 0.02734, params: {'min_samples_leaf': 70, 'min_samples_split': 1400},
  mean: 0.82374, std: 0.02662, params: {'min_samples_leaf': 70, 'min_samples_split': 1600},
  mean: 0.82543, std: 0.02728, params: {'min_samples_leaf': 70, 'min_samples_split': 1800},
  mean: 0.82468, std: 0.02681, params: {'min_samples_leaf': 80, 'min_samples_split': 800},
  mean: 0.82688, std: 0.02378, params: {'min_samples_leaf': 80, 'min_samples_split': 1000},
  mean: 0.82400, std: 0.02718, params: {'min_samples_leaf': 80, 'min_samples_split': 1200},
  mean: 0.82635, std: 0.03008, params: {'min_samples_leaf': 80, 'min_samples_split': 1400},
  mean: 0.82478, std: 0.02849, params: {'min_samples_leaf': 80, 'min_samples_split': 1600},
  mean: 0.82215, std: 0.02679, params: {'min_samples_leaf': 80, 'min_samples_split': 1800},
  mean: 0.82416, std: 0.02264, params: {'min_samples_leaf': 90, 'min_samples_split': 800},
  mean: 0.82559, std: 0.02115, params: {'min_samples_leaf': 90, 'min_samples_split': 1000},
  mean: 0.82556, std: 0.02317, params: {'min_samples_leaf': 90, 'min_samples_split': 1200},
  mean: 0.82452, std: 0.02702, params: {'min_samples_leaf': 90, 'min_samples_split': 1400},
  mean: 0.82319, std: 0.02409, params: {'min_samples_leaf': 90, 'min_samples_split': 1600},
  mean: 0.82400, std: 0.02738, params: {'min_samples_leaf': 90, 'min_samples_split': 1800},
  mean: 0.83031, std: 0.02758, params: {'min_samples_leaf': 100, 'min_samples_split': 800},
  mean: 0.82296, std: 0.02450, params: {'min_samples_leaf': 100, 'min_samples_split': 1000},
  mean: 0.82464, std: 0.02562, params: {'min_samples_leaf': 100, 'min_samples_split': 1200},
  mean: 0.82332, std: 0.02972, params: {'min_samples_leaf': 100, 'min_samples_split': 1400},
  mean: 0.82227, std: 0.02910, params: {'min_samples_leaf': 100, 'min_samples_split': 1600},
  mean: 0.82231, std: 0.02642, params: {'min_samples_leaf': 100, 'min_samples_split': 1800}],
 {'min_samples_leaf': 100, 'min_samples_split': 800},
 0.8303082748093461)

使用最参数进行分类

gbm = GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=40,
                                 min_samples_split=800,
                                 min_samples_leaf=100,
                                 max_depth=5,
                                 max_features="sqrt",
                                 subsample=0.8,
                                 random_state=10)
gbm.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm.predict(x_test)))
Accuracy (train): 0.9833
Accuracy (test): 0.9862
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4931
          1       0.00      0.00      0.00        69

avg / total       0.97      0.99      0.98      5000



D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)