GBDT算法在实际应用中深受大众的喜爱,不同于Adaboost(利用前一轮弱学习器的误差率来更新训练集的误差权重),GBDT采用的是前向分步算法,但是弱学习器则限定了CART决策树模型。
GBDT的通俗理解是:假如一所房屋的价格是100万,我们先用80万去拟合,发现损失20万,这时我们用15万去拟合发现损失5万,接着我们使用4万去拟合返现损失1万,这样一直迭代下去,使得损失误差一直减少到设定的阈值。
如何解决损失函数拟合方法的问题,大牛提出了使用损失函数的负梯度来拟合本轮损失的近似值,进而拟合一个CART回归树。由于公式编辑很费时间,这里给大家推荐一个博客,具体细节可参考。
GBDT的优点有很多,总结如下:
1、由于采用了CART作为弱学习器,可以处理各种类型的数据,包括连续值和离散值
2、预测准确度较高
3、可以采用一些正则化的损失函数,对异常值的鲁棒性非常强
下面对GBDT的实际应用做个小小的demo。
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt %matplotlib inline data = load_iris() X = data.data y = data.target x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25) fig = plt.figure() plt.scatter(x_train[y_train==0][:, 1], x_train[y_train==0][:, 2]) plt.scatter(x_train[y_train==1][:, 1], x_train[y_train==1][:, 2]) plt.scatter(x_train[y_train==2][:, 1], x_train[y_train==2][:, 2]) plt.legend(data.target_names) plt.show()
from sklearn.tree import DecisionTreeClassifier, export_graphviz from IPython.display import Image import pydotplus from sklearn import metrics dtc = DecisionTreeClassifier() dtc.fit(x_train, y_train) print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train))) print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test))) print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test), target_names=data.target_names)) dot_data = export_graphviz(decision_tree=dtc, out_file=None, feature_names=data.feature_names, class_names=data.target_names, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data) Image(graph.create_png())
Accuracy (train): 1 Accuracy (test): 0.9474 混淆矩阵: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.80 1.00 0.89 8 virginica 1.00 0.89 0.94 19 avg / total 0.96 0.95 0.95 38
from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier() gbc.fit(x_train, y_train) print("Accuracy(train) : %.4g" % gbc.score(x_train, y_train)) print("Accuracy(test) : %.4g" % gbc.score(x_test, y_test)) print("混淆矩阵:\n", metrics.classification_report(y_test, gbc.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1 Accuracy(test) : 0.9474 混淆矩阵: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.80 1.00 0.89 8 virginica 1.00 0.89 0.94 19 avg / total 0.96 0.95 0.95 38
from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import GridSearchCV params = {"max_depth": list(range(1,11))} gbc = GradientBoostingClassifier() gs = GridSearchCV(gbc, param_grid=params, cv=10) gs.fit(x_train, y_train) print("Accuracy(train) : %.4g" % gs.score(x_train, y_train)) print("Accuracy(test) : %.4g" % gs.score(x_test, y_test)) print("混淆矩阵:\n", metrics.classification_report(y_test, gs.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1 Accuracy(test) : 0.9474 混淆矩阵: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.80 1.00 0.89 8 virginica 1.00 0.89 0.94 19 avg / total 0.96 0.95 0.95 38
gs.best_estimator_, gs.best_score_, gs.best_params_, gs.grid_scores_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning) (GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=4, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False), 0.9642857142857143, {'max_depth': 4}, [mean: 0.94643, std: 0.06325, params: {'max_depth': 1}, mean: 0.94643, std: 0.07307, params: {'max_depth': 2}, mean: 0.95536, std: 0.06385, params: {'max_depth': 3}, mean: 0.96429, std: 0.04335, params: {'max_depth': 4}, mean: 0.95536, std: 0.06385, params: {'max_depth': 5}, mean: 0.95536, std: 0.06385, params: {'max_depth': 6}, mean: 0.95536, std: 0.06385, params: {'max_depth': 7}, mean: 0.95536, std: 0.06385, params: {'max_depth': 8}, mean: 0.95536, std: 0.06385, params: {'max_depth': 9}, mean: 0.95536, std: 0.06385, params: {'max_depth': 10}])
from sklearn.ensemble import AdaBoostClassifier ada = AdaBoostClassifier() ada.fit(x_train, y_train) ada.score(x_test, y_test) print("Accuracy(train) : %.4g" % ada.score(x_train, y_train)) print("Accuracy(test) : %.4g" % ada.score(x_test, y_test)) print("混淆矩阵:\n", metrics.classification_report(y_test, ada.predict(x_test), target_names=data.target_names))
Accuracy(train) : 0.9821 Accuracy(test) : 0.9474 混淆矩阵: precision recall f1-score support setosa 1.00 1.00 1.00 11 versicolor 0.80 1.00 0.89 8 virginica 1.00 0.89 0.94 19 avg / total 0.96 0.95 0.95 38
import xgboost params = { 'booster': 'gbtree', 'objective': 'multi:softmax', # 多分类的问题 'num_class': 3, # 类别数,与 multisoftmax 并用 'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。 'max_depth': 3, # 构建树的深度,越大越容易过拟合 'lambda': 1, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。 'subsample': 0.8, # 随机采样训练样本 'colsample_bytree': 0.7, # 生成树时进行的列采样 'min_child_weight': 3, 'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0. 'eta': 0.01, # 如同学习率 'seed': 1000, 'nthread': 4, # cpu 线程数 } dtrain = xgboost.DMatrix(x_train, y_train) num_rounds = 500 model = xgboost.train(params=params,dtrain=dtrain, num_boost_round=num_rounds) dtest = xgboost.DMatrix(x_test) ans = model.predict(dtest) cnt1 = 0 cnt2 = 0 for i in range(len(y_test)): if ans[i] == y_test[i]: cnt1 += 1 else: cnt2 += 1 print("Accuracy(test): \n", cnt1/(cnt1 + cnt2))
Accuracy(test): 0.9473684210526315
import pandas as pd train = pd.read_csv("./train_modified.csv") # train.head(10) train.describe() # train[train.isnull().values==True]
Disbursed | Existing_EMI | Loan_Amount_Applied | Loan_Tenure_Applied | Monthly_Income | Var4 | Var5 | Age | EMI_Loan_Submitted_Missing | Interest_Rate_Missing | ... | Var2_2 | Var2_3 | Var2_4 | Var2_5 | Var2_6 | Mobile_Verified_0 | Mobile_Verified_1 | Source_0 | Source_1 | Source_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 20000.000000 | 20000.00000 | 2.000000e+04 | 20000.000000 | 2.000000e+04 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | ... | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 |
mean | 0.016000 | 3890.29622 | 2.424191e+05 | 2.254450 | 4.575767e+04 | 2.966350 | 5.851350 | 30.878950 | 0.644000 | 0.644000 | ... | 0.239300 | 0.009300 | 0.064950 | 0.026450 | 0.000900 | 0.354100 | 0.645900 | 0.115500 | 0.645450 | 0.239050 |
std | 0.125478 | 10534.21647 | 3.582973e+05 | 1.988467 | 4.575422e+05 | 1.575989 | 5.835997 | 6.829651 | 0.478827 | 0.478827 | ... | 0.426667 | 0.095989 | 0.246444 | 0.160473 | 0.029987 | 0.478252 | 0.478252 | 0.319632 | 0.478389 | 0.426514 |
min | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 1.000000e+01 | 1.000000 | 0.000000 | 18.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.00000 | 0.000000e+00 | 0.000000 | 1.700000e+04 | 1.000000 | 0.000000 | 26.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.00000 | 1.000000e+05 | 2.000000 | 2.500000e+04 | 3.000000 | 3.000000 | 29.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
75% | 0.000000 | 4000.00000 | 3.000000e+05 | 4.000000 | 4.000000e+04 | 5.000000 | 11.000000 | 34.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
max | 1.000000 | 420000.00000 | 9.000000e+06 | 10.000000 | 5.495454e+07 | 7.000000 | 17.000000 | 65.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 50 columns
target = "Disbursed" IDcol = "ID" train["Disbursed"].value_counts()
0 19680 1 320 Name: Disbursed, dtype: int64
x_columns = [x for x in train.columns if x not in [target, IDcol]] X = train[x_columns] y = train["Disbursed"]
from sklearn.tree import DecisionTreeClassifier from sklearn import metrics from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25) dtc = DecisionTreeClassifier() dtc.fit(x_train, y_train) print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train))) print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test))) print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test)))
Accuracy (train): 0.9997 Accuracy (test): 0.9682 混淆矩阵: precision recall f1-score support 0 0.99 0.98 0.98 4931 1 0.04 0.06 0.05 69 avg / total 0.97 0.97 0.97 5000
from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics gbm0 = GradientBoostingClassifier(random_state=10) gbm0.fit(x_train, y_train) y_pred = gbm0.predict(x_test) y_predprob = gbm0.predict_proba(x_test)[:,1] print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm0.predict(x_train))) print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm0.predict(x_test))) print("混淆矩阵:\n", metrics.classification_report(y_test, gbm0.predict(x_test)))
Accuracy (train): 0.9844 Accuracy (test): 0.9854 混淆矩阵: precision recall f1-score support 0 0.99 1.00 0.99 4931 1 0.00 0.00 0.00 69 avg / total 0.97 0.99 0.98 5000
from sklearn.model_selection import GridSearchCV params = {"n_estimators": range(20, 81, 10)} gsearch1 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, min_samples_split=300, # 限制子树继续划分的条件,如果某节点的样本数小于这个值,则不会继续尝试选择最优特征进行划分 min_samples_leaf=20, # 限制叶子节点最少的样本数,如果少于这个值则会和兄弟节点一起被剪枝 max_depth=8, # 决策树的最大深度 max_features="sqrt", # 划分时考虑的最大特征数,“sqrt”或者“auto”意味着划分时最多考虑根号(N)个特征 subsample=0.8, # 子采样,不放回采样,0.8表示只使用了80%的数据,可以防止过拟合 random_state=10), param_grid=params, scoring="roc_auc", iid=False, cv=5) gsearch1.fit(x_train, y_train) gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning) ([mean: 0.82240, std: 0.03087, params: {'n_estimators': 20}, mean: 0.82469, std: 0.03055, params: {'n_estimators': 30}, mean: 0.82479, std: 0.03178, params: {'n_estimators': 40}, mean: 0.82445, std: 0.02968, params: {'n_estimators': 50}, mean: 0.82230, std: 0.02993, params: {'n_estimators': 60}, mean: 0.82074, std: 0.02881, params: {'n_estimators': 70}, mean: 0.81918, std: 0.02904, params: {'n_estimators': 80}], {'n_estimators': 40}, 0.8247923927911079)
params2 = {"max_depth": range(3, 14, 2), "min_samples_split": range(100, 801, 200)} gsearch2 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, min_samples_leaf=20, max_features="sqrt", subsample=0.8, random_state=10), param_grid=params2, scoring="roc_auc", iid=False, cv=5) gsearch2.fit(x_train, y_train) gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning) ([mean: 0.81718, std: 0.03177, params: {'max_depth': 3, 'min_samples_split': 100}, mean: 0.81821, std: 0.02824, params: {'max_depth': 3, 'min_samples_split': 300}, mean: 0.81938, std: 0.02993, params: {'max_depth': 3, 'min_samples_split': 500}, mean: 0.81850, std: 0.02894, params: {'max_depth': 3, 'min_samples_split': 700}, mean: 0.82919, std: 0.02452, params: {'max_depth': 5, 'min_samples_split': 100}, mean: 0.82704, std: 0.02582, params: {'max_depth': 5, 'min_samples_split': 300}, mean: 0.82595, std: 0.02603, params: {'max_depth': 5, 'min_samples_split': 500}, mean: 0.82930, std: 0.02581, params: {'max_depth': 5, 'min_samples_split': 700}, mean: 0.82742, std: 0.02200, params: {'max_depth': 7, 'min_samples_split': 100}, mean: 0.81882, std: 0.02066, params: {'max_depth': 7, 'min_samples_split': 300}, mean: 0.82529, std: 0.02404, params: {'max_depth': 7, 'min_samples_split': 500}, mean: 0.82395, std: 0.02940, params: {'max_depth': 7, 'min_samples_split': 700}, mean: 0.82908, std: 0.02157, params: {'max_depth': 9, 'min_samples_split': 100}, mean: 0.81857, std: 0.03291, params: {'max_depth': 9, 'min_samples_split': 300}, mean: 0.82545, std: 0.02825, params: {'max_depth': 9, 'min_samples_split': 500}, mean: 0.82815, std: 0.02859, params: {'max_depth': 9, 'min_samples_split': 700}, mean: 0.81604, std: 0.02591, params: {'max_depth': 11, 'min_samples_split': 100}, mean: 0.82513, std: 0.02261, params: {'max_depth': 11, 'min_samples_split': 300}, mean: 0.82908, std: 0.03235, params: {'max_depth': 11, 'min_samples_split': 500}, mean: 0.82534, std: 0.02583, params: {'max_depth': 11, 'min_samples_split': 700}, mean: 0.81899, std: 0.02132, params: {'max_depth': 13, 'min_samples_split': 100}, mean: 0.82667, std: 0.02806, params: {'max_depth': 13, 'min_samples_split': 300}, mean: 0.82685, std: 0.03581, params: {'max_depth': 13, 'min_samples_split': 500}, mean: 0.82662, std: 0.02611, params: {'max_depth': 13, 'min_samples_split': 700}], {'max_depth': 5, 'min_samples_split': 700}, 0.8292976819017346)
param_test3 = {'min_samples_split':range(800,1900,200), 'min_samples_leaf':range(60,101,10)} gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, max_depth=9, max_features='sqrt', subsample=0.8, random_state=10), param_grid = param_test3, scoring='roc_auc', iid=False, cv=5) gsearch3.fit(x_train, y_train) gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning) ([mean: 0.82938, std: 0.02746, params: {'min_samples_leaf': 60, 'min_samples_split': 800}, mean: 0.82748, std: 0.03127, params: {'min_samples_leaf': 60, 'min_samples_split': 1000}, mean: 0.82002, std: 0.03099, params: {'min_samples_leaf': 60, 'min_samples_split': 1200}, mean: 0.82265, std: 0.03321, params: {'min_samples_leaf': 60, 'min_samples_split': 1400}, mean: 0.82615, std: 0.02846, params: {'min_samples_leaf': 60, 'min_samples_split': 1600}, mean: 0.82273, std: 0.02671, params: {'min_samples_leaf': 60, 'min_samples_split': 1800}, mean: 0.82471, std: 0.03209, params: {'min_samples_leaf': 70, 'min_samples_split': 800}, mean: 0.82705, std: 0.03119, params: {'min_samples_leaf': 70, 'min_samples_split': 1000}, mean: 0.82525, std: 0.02723, params: {'min_samples_leaf': 70, 'min_samples_split': 1200}, mean: 0.82698, std: 0.02734, params: {'min_samples_leaf': 70, 'min_samples_split': 1400}, mean: 0.82374, std: 0.02662, params: {'min_samples_leaf': 70, 'min_samples_split': 1600}, mean: 0.82543, std: 0.02728, params: {'min_samples_leaf': 70, 'min_samples_split': 1800}, mean: 0.82468, std: 0.02681, params: {'min_samples_leaf': 80, 'min_samples_split': 800}, mean: 0.82688, std: 0.02378, params: {'min_samples_leaf': 80, 'min_samples_split': 1000}, mean: 0.82400, std: 0.02718, params: {'min_samples_leaf': 80, 'min_samples_split': 1200}, mean: 0.82635, std: 0.03008, params: {'min_samples_leaf': 80, 'min_samples_split': 1400}, mean: 0.82478, std: 0.02849, params: {'min_samples_leaf': 80, 'min_samples_split': 1600}, mean: 0.82215, std: 0.02679, params: {'min_samples_leaf': 80, 'min_samples_split': 1800}, mean: 0.82416, std: 0.02264, params: {'min_samples_leaf': 90, 'min_samples_split': 800}, mean: 0.82559, std: 0.02115, params: {'min_samples_leaf': 90, 'min_samples_split': 1000}, mean: 0.82556, std: 0.02317, params: {'min_samples_leaf': 90, 'min_samples_split': 1200}, mean: 0.82452, std: 0.02702, params: {'min_samples_leaf': 90, 'min_samples_split': 1400}, mean: 0.82319, std: 0.02409, params: {'min_samples_leaf': 90, 'min_samples_split': 1600}, mean: 0.82400, std: 0.02738, params: {'min_samples_leaf': 90, 'min_samples_split': 1800}, mean: 0.83031, std: 0.02758, params: {'min_samples_leaf': 100, 'min_samples_split': 800}, mean: 0.82296, std: 0.02450, params: {'min_samples_leaf': 100, 'min_samples_split': 1000}, mean: 0.82464, std: 0.02562, params: {'min_samples_leaf': 100, 'min_samples_split': 1200}, mean: 0.82332, std: 0.02972, params: {'min_samples_leaf': 100, 'min_samples_split': 1400}, mean: 0.82227, std: 0.02910, params: {'min_samples_leaf': 100, 'min_samples_split': 1600}, mean: 0.82231, std: 0.02642, params: {'min_samples_leaf': 100, 'min_samples_split': 1800}], {'min_samples_leaf': 100, 'min_samples_split': 800}, 0.8303082748093461)
gbm = GradientBoostingClassifier(learning_rate=0.1, n_estimators=40, min_samples_split=800, min_samples_leaf=100, max_depth=5, max_features="sqrt", subsample=0.8, random_state=10) gbm.fit(x_train, y_train) print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm.predict(x_train))) print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm.predict(x_test))) print("混淆矩阵:\n", metrics.classification_report(y_test, gbm.predict(x_test)))
Accuracy (train): 0.9833 Accuracy (test): 0.9862 混淆矩阵: precision recall f1-score support 0 0.99 1.00 0.99 4931 1 0.00 0.00 0.00 69 avg / total 0.97 0.99 0.98 5000 D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for)