In [332]:
%load_ext blackcellmagic
The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic

XGboost is a very fast, scalable implementation of gradient boosting that has taken data science by storm, with models using XGBoost regularly winning many online data science competitions and used at scale across different industries. In this course, you'll learn how to use this powerful library alongside pandas and scikit-learn to build and tune supervised learning models. You'll work with real-world datasets to solve classification as well as regression problems.

分类示例

This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark.

X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]中,

  1. [:,:-1]全部到倒数第二列,-1是开区间不包含-1
  2. [:,-1]最后一列就是$y$

random_state=123类似于set.seed(123)。 这里的train_test_split是来自sklearn.model_selectionX_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=123)

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123) 所以在模型中,也是要设置随机种子的。

In [333]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
In [334]:
churn_data = pd.read_csv('data/ride-sharing.csv')
In [335]:
# import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective="binary:logistic", n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]
print("accuracy: %f" % (accuracy))
accuracy: 0.743300
In [336]:
accuracy: 0.743300

决策树示例

In [337]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
In [338]:
# Import the necessary modules
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X, y = breast_cancer.data, breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth = 4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)
accuracy: 0.9649122807017544
In [339]:
accuracy: 0.9736842105263158

max_depth为树的高度 sklearn.tree.DecisionTreeClassifier — scikit-learn 0.19.1 documentation

交叉验证 xgb.cv

Boosting,就是找一波 weak learner变得strong。

cross-validation: Generates many non-overlapping train/test splits on training data Reports the average test set performance across all data splits

主要防止 boosting 过拟合。

我们可以把整个数据集分成两部分,一部分用于训练,一部分用于验证,这也就是我们经常提到的训练集(training set)和测试集(test set)。 不过,这个简单的方法存在两个弊端。

  1. 最终模型与参数的选取将极大程度依赖于你对训练集和测试集的划分方法。
  2. 二是划分后,训练模型的样本就少了很多。

比起 test set approach,LOOCV有很多优点。首先它不受测试集合训练集划分方法的影响,因为每一个数据都单独的做过测试集。同时,其用了n-1个数据训练模型,也几乎用到了所有的数据,保证了模型的bias更小。不过LOOCV的缺点也很明显,那就是计算量过于大,是test set approach耗时的n-1倍

K-fold Cross Validation 不难理解,其实LOOCV是一种特殊的K-fold Cross Validation(K=N)。

Bias-Variance Trade-Off for k-Fold Cross-Validation K的选取是一个Bias和Variance的trade-off。

K越大,每次投入的训练集的数据越多,模型的Bias越小。但是K越大,又意味着每一次选取的训练集之前的相关性越大(考虑最极端的例子,当k=N,也就是在LOOCV里,每次都训练数据几乎是一样的)。而这种大相关性会导致最终的test error具有更大的Variance。

参考 机器学习 Cross-Validation(交叉验证)详解

注意这里DMatrixXGBClassifier都是xgboost(xgb)的子module。

xgb.cv中,

  1. num_boosting_rounds表示要迭代次数。
  2. metrics="error"表示计算$Acc$。
In [340]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(np.mean(1-cv_results["test-error-mean"]))
   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.025480         0.002451         0.066824        0.019564
1          0.021969         0.001257         0.061524        0.013876
2          0.014945         0.006589         0.056252        0.010004
3          0.012306         0.003300         0.052734        0.011418
4          0.010549         0.004314         0.054497        0.012485
0.9416338

xgboost是 Extreme Gradient Boosting 简称。

Nice work. cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

这里体现简单做 cv 可以有效提高 test 组的准确率。

AUC

In [342]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print(np.mean(cv_results["test-auc-mean"]))
   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.987225       0.001301       0.961473      0.024760
1        0.993244       0.004295       0.969078      0.022616
2        0.995224       0.003751       0.972491      0.024377
3        0.997125       0.002042       0.971354      0.025405
4        0.997610       0.001871       0.974002      0.026527
0.9696795999999999

注意这里的metrics="auc"要变成metrics=["auc"]输入是文本格式

num_boost_round : int Number of boosting iterations. nfold : int Number of folds in CV.

以后每轮对前一轮训练失败的样本,赋予较大的分布权值( Di 为第i 轮各个样本在样本集中参与训练的概率) ,使其在这一轮训练出现的概率增加,即在后面的训练学习中集中对比较难训练的样本进行学习

Bagging 和 Boosting

一个集合只是一个汇集在一起(例如所有预测的平均值)来作出最终预测的预测器集合。我们使用集成的原因是许多不同的预测变量试图预测相同的目标变量将比任何单一的预测器完成的更好。集成技术进一步分为BaggingBoosting

  1. Bagging是一个简单的集成技术,我们建立许多独立的预测变量/模型/学习者,并使用一些模型平均技术将它们结合起来。(例如加权平均数,多数票或正态平均数)。

我们通常对每个模型采用随机的子样本/bootstrap数据,因此所有模型彼此之间几乎没有差别。每个观察结果在所有模型中出现的概率相同。因为这种技术需要许多不相关的学习者做出最终的模型,所以通过减少方差来减少错误。Bagging集成的例子是随机森林模型。

  1. Boosting是一种集成技术,其中预测变量不是独立的,而是按顺序进行的。

这种技术使用了后面的预测变量从之前的预测变量的错误中学习的逻辑。因此,观测值在后续模型中出现的概率是不相同的,而误差最大的出现最频繁。预测变量可以从一系列模型中选择,如决策树,回归量,分类器等等。因为新的预测变量是从以前的预测变量所犯的错误中学习的,所以需要更少的时间/次数来接近实际的预测。但是我们必须慎重选择停机判据,否则可能导致训练数据过度拟合。梯度提升是Boosting算法的一个例子

学习率

$$\hat y = \hat y + \alpha \cdot \frac{\partial \sum(\hat y - y)^2}{\partial \hat y}$$

其中$\alpha$是学习率。 $$\frac{\partial \sum(\hat y - y)^2}{\partial \hat y} =

  • \alpha \cdot 2 \cdot \sum(\hat y - y)$$

因此,梯度提升算法的直觉就是反复利用残差模式,加强预测能力较弱的模型,使其更好。 一旦我们达到残差没有任何模式可以建模的阶段,我们可以停止建模残差(否则可能导致过度拟合)。 就是不要去抓噪音。

XGboost 的适用性

适用于

  1. 当$N >> col$,如1000个样本对应100个变量
  2. 连续变量和分类变量都有的时候,或者只有连续变量

不适用于

  1. 稀疏矩阵,如图像数据和文本数据,这个最好使用深度学习
  2. 样本量太少

回归评价指标

评价指标不是 objective。

$$RMSE = (\frac{\sum (y-\hat y)^2}{n})^{\frac{1}{2}}$$$$MAE = \frac{\sum|y-\hat y|}{n}$$

objective 和 base learner

... use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

Linear Base Learner:

  • Sum of linear terms
  • Boosted model is weighted sum of linear models (thus is itself linear)
  • Rarely used

Tree Base Learner:

  • Decision tree
  • Boosted model is weighted sum of decision trees (nonlinear)
  • Almost exclusively used in XGBoost

booster 就是选择 base learner 基学习器,一种就是 tree 一种就是 linear。后者不常用,因为不容易发挥非线性作用。

Objective 就是我们说的 loss functions

  1. reg:linear - use for regression problems
  2. reg:logistic - use for classi×cation problems when you want just decision, not probability
  3. binary:logistic - use when you want probability rather than just decision

Want base learners that when combined create ×nal prediction that is non-linear

每个基学习器合成后,就可以拟合出非线性的关系。

Each base learner should be good at distinguishing or predicting different parts of the dataset

每个基学习器只能很好的学习一部分的数据。

学习器选择

In [345]:
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
In [346]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
[13:35:10] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
RMSE: 9.749041
In [348]:
xg_reg.booster
Out[348]:
'gbtree'
In [349]:
DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test = xgb.DMatrix(data=X_test,label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10)
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
[13:35:18] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
RMSE: 6.061921
In [350]:
xg_reg.booster
Out[350]:
'gblinear'

因此这类 base learner 选择 linear 反而更好了。因此回归问题时,可以考虑学习器是 linear。

交叉验证

一个是 rmse,一个是 mae。

In [351]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0        17.120438        0.057830       17.151866       0.295723
1        12.353698        0.034427       12.510376       0.372386
2         9.017977        0.038795        9.245965       0.314345
3         6.690101        0.047236        7.060159       0.317659
4         5.069411        0.048644        5.571861       0.252100
4    5.571861
Name: test-rmse-mean, dtype: float64
In [352]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='mae', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
   train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
0       15.584812       0.087903      15.567934      0.345122
1       11.036514       0.069404      11.044831      0.347553
2        7.827224       0.052691       7.886081      0.315104
3        5.596108       0.044331       5.718952      0.288004
4        4.062843       0.052193       4.285985      0.175467
4    4.285985
Name: test-mae-mean, dtype: float64

正则化

参考 DataCamp

Regularization parameters in XGBoost:

  • gamma - minimum loss reduction allowed for a split to occur
  • alpha - l1 regularization on leaf weights, larger values mean more regularization
  • lambda - l2 regularization on leaf weights
  1. gamma 或者说 leaf 的数量限制,意思差不多。
  2. alpha 和 lambda 针对学习器,alpha 和 beta 的含义不同。
    1. 针对于 gbtree 的学习器,就是 leaf weight 进入损失函数进行限制;
    2. 针对于 gbtree 的学习器,就是 feature weight 进入损失函数进行限制。类似于 sklearn.LinearRegression 中的 alpha。
In [353]:
X, y = boston.data, boston.target
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
l1_params = [1, 10, 100]
# Create the initial parameter dictionary for varying l1 strength: params
params = {"objective":"reg:linear","max_depth":4} # by default, gbtree
# Create an empty list for storing rmses as a function of l1 complexity
rmses_l1 = []
# Iterate over reg_params
for reg in l1_params:

    # Update l1 strength
    params["alpha"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", 
                             as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l1
    rmses_l1.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [354]:
pd.DataFrame(list(zip(l1_params, rmses_l1)), columns=["l1", "rmse"])
Out[354]:
l1 rmse
0 1 5.924174
1 10 6.229010
2 100 7.139736

l2 rmse
0 1 7.139736
1 10 7.950858
2 100 10.937085

num_boosting_rounds rmse
0 50 3.335649
1 60 3.336287
2 70 3.335520
3 80 3.336278
4 90 3.336160
5 100 3.336507
6 110 3.336662
7 120 3.336759
8 130 3.336809
9 140 3.336778
10 150 3.336770
11 160 3.336770
12 170 3.336770
13 180 3.336770
14 190 3.336770

选择70。

early stopping

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50).

early_stopping_rounds 是作用于 test 组上的,看 test 组的 RMSE 是否有持续下降。

In [366]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective": "reg:squarederror"}

# Create list of number of boosting rounds
early_stopping_round_list = list(np.multiply(list(range(1,20)),1))
early_stopping_round_list.append(None)

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_val in early_stopping_round_list:
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(
        dtrain=housing_dmatrix,
        params=params,
        early_stopping_rounds=curr_val,
        num_boost_round=70,
        metrics="rmse",
        as_pandas=True,
        seed=123,
    )

    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

    # Print the resultant DataFrame
    early_stopping_round_rmses = list(zip(early_stopping_round_list, final_rmse_per_round))

pd.DataFrame(early_stopping_round_rmses, columns=["early_stopping_rounds", "rmse"])
Out[366]:
early_stopping_rounds rmse
0 1.0 3.358666
1 2.0 3.334396
2 3.0 3.334396
3 4.0 3.334396
4 5.0 3.334396
5 6.0 3.334396
6 7.0 3.334396
7 8.0 3.334396
8 9.0 3.334396
9 10.0 3.329619
10 11.0 3.329619
11 12.0 3.329619
12 13.0 3.329619
13 14.0 3.329619
14 15.0 3.329619
15 16.0 3.329619
16 17.0 3.329619
17 18.0 3.329619
18 19.0 3.329619
19 NaN 3.335520

learning rate

The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

$\eta \in [0,1]$是学习率。 并且学习率是一个正则化手段,越高越好。

In [367]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective": "reg:squarederror"}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1, 0.2, 0.3]
best_rmse = []

# Systematically vary the eta
for curr_val in eta_vals:

    params["eta"] = curr_val

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(
        dtrain=housing_dmatrix,
        params=params,
        early_stopping_rounds=9,
        num_boost_round=70,
        metrics="rmse",
        as_pandas=True,
        seed=123,
    )

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta", "best_rmse"])
Out[367]:
eta best_rmse
0 0.001 22.361745
1 0.010 12.648530
2 0.100 3.275962
3 0.200 3.231944
4 0.300 3.334396

max_depth

In [368]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary
params = {"objective": "reg:squarederror","eta": 0.2}

# Create list of max_depth values
max_depths = [2, 3,5, 6,10, 20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val

    # Perform cross-validation
    cv_results = xgb.cv(
        dtrain=housing_dmatrix,
        params=params,
        early_stopping_rounds=9,
        num_boost_round=70,
        metrics="rmse",
        as_pandas=True,
        seed=123,
    )

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
pd.DataFrame(list(zip(max_depths, best_rmse)), columns=["max_depth", "best_rmse"])
Out[368]:
max_depth best_rmse
0 2 3.314754
1 3 3.262898
2 5 3.237644
3 6 3.231944
4 10 3.287999
5 20 3.300656

colsample_bytree

Now, it's time to tune "colsample_bytree". You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

每次分支时,考虑使用的变量占总体的比例。

In [369]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary
params = {"objective": "reg:squarederror", "eta": 0.2, "max_depth": 6}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 0.9, 0.95, 1]
best_rmse = []

# Systematically vary the hyperparameter value
for curr_val in colsample_bytree_vals:

    params["colsample_bytree"] = curr_val

    # Perform cross-validation
    cv_results = xgb.cv(
        dtrain=housing_dmatrix,
        params=params,
        early_stopping_rounds=9,
        num_boost_round=70,
        metrics="rmse",
        as_pandas=True,
        seed=123,
    )

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
pd.DataFrame(
    list(zip(colsample_bytree_vals, best_rmse)),
    columns=["colsample_bytree", "best_rmse"],
)
Out[369]:
colsample_bytree best_rmse
0 0.10 5.080896
1 0.50 3.309777
2 0.80 3.440138
3 0.90 3.459225
4 0.95 3.383537
5 1.00 3.231944

subsample

subsample dictates the fraction of the training data that is used during any given boosting round.

In [370]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary
params = {
    "objective": "reg:squarederror",
    "eta": 0.2,
    "max_depth": 6,
    "colsample_bytree": 1,
}

# Create list of hyperparameter values: colsample_bytree_vals
subsample_vals = [0.1, 0.5, 0.8, 0.9, 0.95, 1]
best_rmse = []

# Systematically vary the hyperparameter value
for curr_val in subsample_vals:

    params["subsample"] = curr_val

    # Perform cross-validation
    cv_results = xgb.cv(
        dtrain=housing_dmatrix,
        params=params,
        early_stopping_rounds=9,
        num_boost_round=70,
        metrics="rmse",
        as_pandas=True,
        seed=123,
    )

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
pd.DataFrame(
    list(zip(subsample_vals, best_rmse)),
    columns=["subsample", "best_rmse"],
)
Out[370]:
subsample best_rmse
0 0.10 3.994764
1 0.50 3.306088
2 0.80 3.167907
3 0.90 3.274909
4 0.95 3.271472
5 1.00 3.231944

所以最好是

params = {
    "objective": "reg:squarederror",
    "eta": 0.2,
    "max_depth": 6,
    "colsample_bytree": 1,
    "subsample": 1
}
early_stopping_rounds=9,
num_boost_round=70,

以上的方式,查询到了每个超参数在其他超参数默认的情况下的最优解,目前将多种情况一起考了,找到附近的最优解。

In [371]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
In [372]:
X, y = boston.data, boston.target
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    "num_boost_round": [65,70,75],
    "early_stopping_rounds": [9, 10, 11],
    "eta": [0.15, 0.2, 0.25],
    "max_depth": [5,6,7],
    "colsample_bytree": [0.9,0.95,1],
    "subsample": [0.9,0.95,1],
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# Perform grid search: grid_mse
grid_mse = GridSearchCV(
    estimator=gbm,
    param_grid=gbm_param_grid,
    scoring="neg_mean_squared_error",
    cv=3,
    verbose=1,
)

# Fit grid_mse to the data
grid_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Fitting 3 folds for each of 729 candidates, totalling 2187 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found:  {'colsample_bytree': 1, 'early_stopping_rounds': 9, 'eta': 0.15, 'max_depth': 5, 'num_boost_round': 65, 'subsample': 1}
Lowest RMSE found:  5.064016178684625
[Parallel(n_jobs=1)]: Done 2187 out of 2187 | elapsed:  3.2min finished
D:\install\miniconda\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

这里verbose=1主要是导出比较好看的 output。但是太耗费时间。

In [373]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

params = {
    "objective": "reg:squarederror",
    "eta": 0.15,
    "max_depth": 5,
    "colsample_bytree": 1,
    "subsample": 1
}

best_rmse = []

# Systematically vary the hyperparameter value


# Perform cross-validation
cv_results = xgb.cv(
    dtrain=housing_dmatrix,
    params=params,
    early_stopping_rounds=9,
    num_boost_round=65,
    metrics="rmse",
    as_pandas=True,
    seed=123,
)

# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(best_rmse)
[3.2324746666666666]
In [374]:
X, y = boston.data, boston.target
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    "num_boost_round": [65,70,75],
    "early_stopping_rounds": [9, 10, 11],
    "eta": [0.15, 0.2, 0.25],
    "max_depth": [5,6,7],
    "colsample_bytree": [0.9,0.95,1],
    "subsample": [0.9,0.95,1],
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# Perform grid search: grid_mse
randomized_mse = RandomizedSearchCV(
    estimator=gbm,
    param_distributions=gbm_param_grid,
    scoring="neg_mean_squared_error",
    cv=3,
    verbose=1,
)

# Fit grid_mse to the data
randomized_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found:  {'subsample': 0.9, 'num_boost_round': 75, 'max_depth': 5, 'eta': 0.15, 'early_stopping_rounds': 9, 'colsample_bytree': 0.95}
Lowest RMSE found:  5.077023302795019
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.7s finished
D:\install\miniconda\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
In [375]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

params = {
    "objective": "reg:squarederror",
    "eta": 0.15,
    "max_depth": 5,
    "colsample_bytree": 0.95,
    "subsample": 0.9
}

best_rmse = []

# Systematically vary the hyperparameter value


# Perform cross-validation
cv_results = xgb.cv(
    dtrain=housing_dmatrix,
    params=params,
    early_stopping_rounds=9,
    num_boost_round=75,
    metrics="rmse",
    as_pandas=True,
    seed=123,
)

# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(best_rmse)
[3.232053333333333]