在本教程中，你将学习如何使用梯度提升来构建和优化模型。该方法在众多 Kaggle 比赛中占据主导地位，并在各种数据集上取得了最佳结果。

Introduction

在本课程的大部分内容中，您已经使用随机森林方法进行了预测，该方法通过对多棵决策树的预测进行平均，实现了比单棵决策树更好的性能。

我们将随机森林方法称为“集成方法”。根据定义，集成方法会组合多个模型（例如，在随机森林中，是多棵树）的预测。

接下来，我们将学习另一种集成方法，称为梯度提升。

Gradient Boosting

梯度提升 是一种循环迭代地将模型添加到集成中的方法。

它首先用一个模型初始化集成，该模型的预测结果可能非常不准确。（即使它的预测结果非常不准确，后续添加到集成中的模型也会纠正这些错误。）

然后，我们开始循环：

首先，我们使用当前集成对数据集中的每个观测值生成预测。为了进行预测，我们将集成中所有模型的预测结果相加。
这些预测结果用于计算损失函数（例如均方误差）。
然后，我们使用损失函数拟合一个即将添加到集成中的新模型。具体来说，我们确定模型参数，以便将这个新模型添加到集成中可以降低损失。 (附注：“梯度提升”中的“梯度”指的是我们将在损失函数上使用梯度下降来确定这个新模型的参数。)
最后，我们将新模型添加到集成模型中，然后……
……重复！

tut6_boosting

Example

我们首先在“X_train”、“X_valid”、“y_train”和“y_valid”中加载训练和验证数据。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

在本例中，您将使用 XGBoost 库。XGBoost 代表极端梯度提升，它是梯度提升的一种实现，并添加了几个专注于性能和速度的附加功能。（Scikit-learn 有另一个版本的梯度提升，但 XGBoost 具有一些技术优势。）

在下一个代码单元中，我们导入了 XGBoost 的 scikit-learn API (xgboost.XGBRegressor)。这使我们能够像在 scikit-learn 中一样构建和拟合模型。正如您在输出中看到的，XGBRegressor 类有许多可调参数——您很快就会了解这些参数！

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

我们还做出预测并评估模型。

from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 241041.5160392121

Parameter Tuning

XGBoost 有一些参数会显著影响准确率和训练速度。您应该首先了解以下参数：

`n_estimators`

n_estimators 指定执行上述建模周期的次数。它等于我们在集成中包含的模型数量。

值过低会导致欠拟合，从而导致对训练数据和测试数据的预测不准确。
值过高会导致过拟合，从而导致对训练数据的预测准确，但对测试数据的预测不准确（这才是我们关心的）。

典型值的范围是 100-1000，但这在很大程度上取决于下面讨论的 learning_rate 参数。

以下是设置集成中模型数量的代码：

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

`early_stopping_rounds`

early_stopping_rounds 提供了一种自动找到 n_estimators 理想值的方法。即使 n_estimators 尚未达到硬停止点，提前停止也会导致模型在验证分数停止提升时停止迭代。明智的做法是将 n_estimators 设置为一个较高的值，然后使用 early_stopping_rounds 找到停止迭代的最佳时机。

由于随机因素有时会导致验证分数在某一轮中没有提升，因此您需要指定一个数字，以表示在停止之前允许连续下降的轮数。设置 early_stopping_rounds=5 是一个合理的选择。在本例中，我们会在验证分数连续下降 5 轮后停止迭代。

使用 early_stopping_rounds 时，您还需要预留一些数据用于计算验证分数——这可以通过设置 eval_set 参数来实现。

我们可以修改上面的示例，使其包含提前停止：

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
/opt/conda/lib/python3.7/site-packages/xgboost/sklearn.py:797: UserWarning: `early_stopping_rounds` in `fit` method is deprecated for better compatibility with scikit-learn, use `early_stopping_rounds` in constructor or`set_params` instead.
  UserWarning,
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

如果您稍后想要用所有数据拟合模型，请将 n_estimators 设置为您在使用提前停止算法运行时发现的最佳值。

`learning_rate`

与其简单地将每个组件模型的预测结果相加，不如先将每个模型的预测结果乘以一个小数（称为学习率），然后再将它们相加。

这意味着我们添加到集成模型中的每棵树对训练集的帮助都更小。因此，我们可以将 n_estimators 设置得更高，而不会出现过拟合。如果我们使用提前停止算法，系统将自动确定合适的树数量。

通常，较小的学习率和较大的估计器数量将产生更准确的 XGBoost 模型，但由于模型在训练过程中会进行更多次迭代，因此训练时间也会更长。默认情况下，XGBoost 设置 learning_rate=0.1。

修改上述示例以更改学习率，代码如下：

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=0, reg_lambda=1, ...)

`n_jobs`

对于较大的数据集，如果需要考虑运行时间，您可以使用并行性来更快地构建模型。通常将参数“n_jobs”设置为等于机器的核心数。对于较小的数据集，这不会有帮助。

生成的模型不会有任何改进，因此对拟合时间进行微优化通常只会分散注意力。但是，这在大型数据集中很有用，否则您将在“fit”命令中花费很长时间等待。

以下是修改后的示例：

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=4, num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=0, reg_lambda=1, ...)

Conclusion

XGBoost 是一个领先的软件库，用于处理标准表格数据（即存储在 Pandas DataFrame 中的数据类型，而不是图像和视频等更特殊的数据类型）。通过仔细调整参数，您可以训练出高精度的模型。