【译】使用H2O进行集成学习【1】

时间 2019-11-16 标签译使用 h2o h 2 o 进行集成学习 1

H2O Ensemble: Stacking in H2O

若你不能成功安装这个版本不要纠结，你能够看第二篇译文，但我建议你先浏览一遍这篇文章
H2O Ensemble已经实现成为一个成为h2oEnsemble的独立R包。该包是h2o这个包的扩展，它容许用户在h2o集群上使用任意的h2o监督学习算法来训练一个集成模型。在h2o这个R包中，h2oEnsemble中的全部计算实际上都在H2O集群内部执行，而不是在R内存中执行。html

Super Learner集成算法中的主要计算任务是初级学习器与次级学习器的训练和交叉验证。所以，在R中（而不是在Java中）实现集成的“plumbing”不会致使性能的损失。全部的训练和数据处理都在高性能H2O集群中进行。java

H2O Ensemble目前只支持回归和二分类任务，将在之后的版本中添加多分类支持。git

译者注：最新版的h2o包运行下面代码会报错，建议按照老版本。按装老版本代码以下，可能会有点慢(h2o这个包有50M)，并且h2o包运行须要java环境。github

install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/R")))

安装 H2O Ensemble

为了安装 h2oEnsemble包，你只须要按照README文件中的安装说明，这也是为了方便起见。算法

H2O R Package

首先，你须要安装H2O R包，若是你尚未安装它。R安装说明参见：http://h2o.ai/downloadsegmentfault

H2O Ensemble R Package

推荐的h2oEnsemble R软件包的安装方式是直接从GitHub使用devtools软件包。（H2O World教程参加者能够从提供的U盘安装软件包）。oracle

从GitHub上进行安装

library(devtools)
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")

Higgs Demo

这是一个使用h2o.ensemble函数的二分类例子，h2o.ensemble是 h2oEnsemble包里的一个函数。这个演示使用的是 HIGGS dataset数据集的子集，有28个数值特征和一个二分类响应变量，在该示例中的机器学习任务是区分产生Higgs 玻色子(Y = 1)的和不产生玻色子的背景(Y = 0)。数据集的正例反例大体相同，也就是说这是一个类别平衡的数据集。app

若是从纯R运行，请在此脚本的目录中执行R。若是从RStudio运行，请确保setwd()到此脚本的位置。 h2o.init()在R的当前工做目录中启动H2O。 h2o.importFile()是h2o中的文件导入函数。dom

开启h2o集群

library(h2oEnsemble)  # This will load the `h2o` R package as well
h2o.init(nthreads = -1,enable_assertions = FALSE)  # Start an H2O cluster with nthreads = num cores on your machine，-1 means use  all CPUs on the host
h2o.removeAll() # (Optional) Remove all objects in H2O cluster

导入数据

首先导入训练集和测试集机器学习

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"

对于二分类问题，响应变量应该是一个factor 类型(在JAVA中为 enum类型，Python中的Pandas为categorial类型),用户能够在使用h2o.importFile函数时指定列的类型，你也能够按照以下方法指定列类型：

train[,y] <- as.factor(train[,y])  
test[,y] <- as.factor(test[,y])

指定初级学习器与次级学习器

在这里，咱们将使用h2o.ensemble的默认初级学习器库,默认的函数包括GLM, Random Forest, GBM and Deep Neural Net (全部模型使用默认的参数)。同时，次级学习器咱们也使用默认的 H2O GLM。

learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper", 
             "h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.glm.wrapper"

训练一个集成模型

使用5折交叉验证进行训练来产生level-one数据。值得注意的是，使用更多的折会消耗更多的时间，但也许会提升性能。

fit <- h2o.ensemble(x = x, y = y, 
                    training_frame = train, 
                    family = family, 
                    learner = learner, 
                    metalearner = metalearner,
                    cvControl = list(V = 5))

评估模型性能

因为响应变量是二分类的，咱们可使用ROC曲线下面积(AUC)来评估模型性能。计算测试集性能，并按AUC(二项分类的默认度量)排序：

perf <- h2o.ensemble_performance(fit, newdata = test)

输出各个初级学习器的性能与集成模型的性能：

> perf

Base learner performance, sorted by specified metric:
                   learner       AUC
1          h2o.glm.wrapper 0.6824304
4 h2o.deeplearning.wrapper 0.7006335
2 h2o.randomForest.wrapper 0.7570211
3          h2o.gbm.wrapper 0.7780807


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.781580655670451

咱们能够比较总体的性能与个体学习器在总体中的表现。

咱们能够看到最好的单模型是GBM,在测试集上的AUC为0.778，而集成之后的得分为0.7815。起初认为这点提升彷佛不太多，但在许多行业，如医药或金融，这个小优点是很是有价值的。

为了提升集成的性能，咱们有几个选择。

经过 cvControl 参数来增长交叉验证的折数。
改变初级学习器与次级学习器。

注意，上面的集成结果是不可重现的，由于 h2o.deeplearning 在使用多个核时结果不可重现，而且咱们没有为 h2o.randomForest.wrapper设置随机种子。

若是你想使用不一样的评测方式，好比说"MSE"，咱们能够经过 print 函数来实现。

> print(perf, metric = "MSE")

Base learner performance, sorted by specified metric:
                   learner       MSE
4 h2o.deeplearning.wrapper 0.2305775
1          h2o.glm.wrapper 0.2225176
2 h2o.randomForest.wrapper 0.2014339
3          h2o.gbm.wrapper 0.1916273


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (MSE): 0.1898735479034431

Predict

若是你须要生成预测值（而不是只看模型性能），你能够在测试集上使用predict函数。

pred <- predict(fit, newdata = test)

若是须要将预测值返回R内存中进行进一步处理，能够将ped转换为本地R 的数据框，以下所示：

predictions <- as.data.frame(pred$pred)[,3]  #third column is P(Y==1)
labels <- as.data.frame(test[,y])[,1]

h2o.ensemble拟合的predict方法将返回一个列表，它包含两个对象。 pred$pred对象包含的是集成的预测结果， pred$basepred 返回的是一个矩阵，包含每一个初级学习器的预测值。在这个例子中，咱们使用了4个初级学习器，因此pred$basepred 返回的矩阵包含4列。

指定新的学习器

如今让咱们再试一下更多的基学习器。h2oEnsemble包默认有四个函数，能够自定义使用非默认参数。

这里是如何生成自定义学习器的示例：

h2o.glm.1 <- function(..., alpha = 0.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.2 <- function(..., alpha = 0.5) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.3 <- function(..., alpha = 1.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.randomForest.1 <- function(..., ntrees = 200, nbins = 50, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.randomForest.2 <- function(..., ntrees = 200, sample_rate = 0.75, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.3 <- function(..., ntrees = 200, sample_rate = 0.85, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.4 <- function(..., ntrees = 200, nbins = 50, balance_classes = TRUE, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, balance_classes = balance_classes, seed = seed)
h2o.gbm.1 <- function(..., ntrees = 100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)
h2o.gbm.2 <- function(..., ntrees = 100, nbins = 50, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.gbm.3 <- function(..., ntrees = 100, max_depth = 10, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.gbm.4 <- function(..., ntrees = 100, col_sample_rate = 0.8, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.5 <- function(..., ntrees = 100, col_sample_rate = 0.7, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.6 <- function(..., ntrees = 100, col_sample_rate = 0.6, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.7 <- function(..., ntrees = 100, balance_classes = TRUE, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, balance_classes = balance_classes, seed = seed)
h2o.gbm.8 <- function(..., ntrees = 100, max_depth = 3, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.3 <- function(..., hidden = c(500,500), activation = "RectifierWithDropout", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.4 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, balance_classes = TRUE, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, balance_classes = balance_classes, seed = seed)
h2o.deeplearning.5 <- function(..., hidden = c(100,100,100), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.6 <- function(..., hidden = c(50,50), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.7 <- function(..., hidden = c(100,100), activation = "Rectifier", epochs = 50, seed = 1)  h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)

咱们来选取基学习器一个子集，并从新训练集成模型。

自定义初级学习器

learner <- c("h2o.glm.wrapper",
             "h2o.randomForest.1", "h2o.randomForest.2",
             "h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8",
             "h2o.deeplearning.1", "h2o.deeplearning.6", "h2o.deeplearning.7")

用新的初级学习器来进行训练：

fit <- h2o.ensemble(x = x, y = y, 
                    training_frame = train,
                    family = family, 
                    learner = learner, 
                    metalearner = metalearner,
                    cvControl = list(V = 5))

评估测试集性能：

perf <- h2o.ensemble_performance(fit, newdata = test)

结果以下：

> perf

Base learner performance, sorted by specified metric:
             learner       AUC
1    h2o.glm.wrapper 0.6824304
7 h2o.deeplearning.1 0.6897187
8 h2o.deeplearning.6 0.6998472
9 h2o.deeplearning.7 0.7048874
2 h2o.randomForest.1 0.7668024
3 h2o.randomForest.2 0.7697849
4          h2o.gbm.1 0.7751240
6          h2o.gbm.8 0.7752852
5          h2o.gbm.6 0.7771115


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.780924502576107

那么，若是咱们移除一些较弱的学习器，那么会发生什么呢？让咱们从学习器中删除GLM和DL，看看会发生什么。

learner <- c("h2o.randomForest.1", "h2o.randomForest.2",
             "h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8")

再次从新训练集成模型并评估性能：

fit <- h2o.ensemble(x = x, y = y, 
                     training_frame = train,
                     family = family, 
                     learner = learner, 
                     metalearner = metalearner,
                     cvControl = list(V = 5))

perf <- h2o.ensemble_performance(fit, newdata = test)

实际上，移除弱学习器后咱们的集成表现有所降低！这代表了堆叠与大量和多样化的基学习器的做用。

> perf

Base learner performance, sorted by specified metric:
             learner       AUC
1 h2o.randomForest.1 0.7668024
2 h2o.randomForest.2 0.7697849
3          h2o.gbm.1 0.7751240
5          h2o.gbm.8 0.7752852
4          h2o.gbm.6 0.7771115


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.778853964308554

首先你会想到，你能够假设去除性能较低的模型会提升系综的性能。然而，每一个学习器都有本身对集成模型的独特贡献，学习器之间的多样性一般会提升性能。Stacking 算法是以优于其余结合方法的方式，将全部学习器组合在一块儿的优化方式。

Stacking 现有的模型集

下面为Stacking示意图：

您也可使用h2o模型的做为起点，并使用h2o.stack() 函数将它们经过指定的次级学习器。

初级学习器必须已经在相同响应变量的相同数据集上训练，而且对于交叉验证必须已经使用相同的折数。

示例以下。如上所述，启动H2O集群并加载训练和测试数据。

library(h2oEnsemble)
h2o.init(nthreads = -1) # Start H2O cluster using all available CPU threads


# Import a sample binary outcome train/test set into R
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"

#For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

使用交叉验证训练少数基学习器，而后使用h2o.stack()函数建立集成模型：

# The h2o.stack function is an alternative to the h2o.ensemble function, which
# allows the user to specify H2O models individually and then stack them together
# at a later time.  Saved models, re-loaded from disk, can also be stacked.

# The base models must use identical cv folds; this can be achieved in two ways:
# 1. they be specified explicitly by using the fold_column argument, or
# 2. use same value for `nfolds` and set `fold_assignment = "Modulo"`

nfolds <- 5  

glm1 <- h2o.glm(x = x, y = y, family = family, 
                training_frame = train,
                nfolds = nfolds,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)

gbm1 <- h2o.gbm(x = x, y = y, distribution = "bernoulli",
                training_frame = train,
                seed = 1,
                nfolds = nfolds,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)

rf1 <- h2o.randomForest(x = x, y = y, # distribution not used for RF
                        training_frame = train,
                        seed = 1,
                        nfolds = nfolds,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

dl1 <- h2o.deeplearning(x = x, y = y, distribution = "bernoulli",
                        training_frame = train,
                        nfolds = nfolds,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"

stack <- h2o.stack(models = models,
                   response_frame = train[,y],
                   metalearner = metalearner, 
                   seed = 1,
                   keep_levelone_data = TRUE)


# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = test)

输出初级学习器和集成模型在测试集上的性能：

> print(perf)

Base learner performance, sorted by specified metric:
                                   learner       AUC
1          GLM_model_R_1480128759162_16643 0.6822933
4 DeepLearning_model_R_1480128759162_18909 0.7016809
3          DRF_model_R_1480128759162_17790 0.7546005
2          GBM_model_R_1480128759162_16661 0.7780807


H2O Ensemble Performance on <newdata>:
----------------
Family: binomial

Ensemble performance (AUC): 0.781241759877087

Roadmap for H2O Ensemble

H2O Ensemble目前只能使用R API，可是，它将在将来的版本中经过咱们的全部API访问。

更新：Ensembles已经在H2O Java核心中被实现为模型类，"H2OStackedEnsembleEstimator".

代码能够在h2o-3的ensembles分支上找到。 R和Python中的API即将推出。

参见h2o中的函数h2o.stackedEnsemble

关掉 H2O

 h2o.shutdown()

本篇教程附带的h2o幻灯片在这里.

Github的ensembles网页在这里.

原文地址https://github.com/h2oai/h2o-...