%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic

XGboost is a very fast, scalable implementation of gradient boosting that has taken data science by storm, with models using XGBoost regularly winning many online data science competitions and used at scale across different industries. In this course, you'll learn how to use this powerful library alongside pandas and scikit-learn to build and tune supervised learning models. You'll work with real-world datasets to solve classification as well as regression problems.

分类示例¶

This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark.

X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]中，

[:,:-1]全部到倒数第二列，-1是开区间不包含-1。
[:,-1]最后一列就是$y$

random_state=123类似于set.seed(123)。这里的train_test_split是来自sklearn.model_selection。 X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=123)

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123) 所以在模型中，也是要设置随机种子的。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

churn_data = pd.read_csv('data/ride-sharing.csv')

# import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective="binary:logistic", n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.743300

accuracy: 0.743300

决策树示例¶

from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

# Import the necessary modules
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X, y = breast_cancer.data, breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth = 4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

accuracy: 0.9649122807017544

accuracy: 0.9736842105263158

max_depth为树的高度 sklearn.tree.DecisionTreeClassifier — scikit-learn 0.19.1 documentation

交叉验证 xgb.cv¶

Boosting，就是找一波 weak learner变得strong。

cross-validation: Generates many non-overlapping train/test splits on training data Reports the average test set performance across all data splits

主要防止 boosting 过拟合。

我们可以把整个数据集分成两部分，一部分用于训练，一部分用于验证，这也就是我们经常提到的训练集（training set）和测试集（test set）。不过，这个简单的方法存在两个弊端。

最终模型与参数的选取将极大程度依赖于你对训练集和测试集的划分方法。
二是划分后，训练模型的样本就少了很多。

比起 test set approach，LOOCV有很多优点。首先它不受测试集合训练集划分方法的影响，因为每一个数据都单独的做过测试集。同时，其用了n-1个数据训练模型，也几乎用到了所有的数据，保证了模型的bias更小。不过LOOCV的缺点也很明显，那就是计算量过于大，是test set approach耗时的n-1倍。

K-fold Cross Validation 不难理解，其实LOOCV是一种特殊的K-fold Cross Validation（K=N）。

Bias-Variance Trade-Off for k-Fold Cross-Validation K的选取是一个Bias和Variance的trade-off。

K越大，每次投入的训练集的数据越多，模型的Bias越小。但是K越大，又意味着每一次选取的训练集之前的相关性越大（考虑最极端的例子，当k=N，也就是在LOOCV里，每次都训练数据几乎是一样的）。而这种大相关性会导致最终的test error具有更大的Variance。

参考机器学习 Cross-Validation（交叉验证）详解

注意这里DMatrix和XGBClassifier都是xgboost(xgb)的子module。

xgb.cv中，

num_boosting_rounds表示要迭代次数。
metrics="error"表示计算$Acc$。

# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(np.mean(1-cv_results["test-error-mean"]))

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.025480         0.002451         0.066824        0.019564
1          0.021969         0.001257         0.061524        0.013876
2          0.014945         0.006589         0.056252        0.010004
3          0.012306         0.003300         0.052734        0.011418
4          0.010549         0.004314         0.054497        0.012485
0.9416338

xgboost是 Extreme Gradient Boosting 简称。

Nice work. cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

这里体现简单做 cv 可以有效提高 test 组的准确率。

AUC¶

# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print(np.mean(cv_results["test-auc-mean"]))

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.987225       0.001301       0.961473      0.024760
1        0.993244       0.004295       0.969078      0.022616
2        0.995224       0.003751       0.972491      0.024377
3        0.997125       0.002042       0.971354      0.025405
4        0.997610       0.001871       0.974002      0.026527
0.9696795999999999

注意这里的metrics="auc"要变成metrics=["auc"]，输入是文本格式。

num_boost_round : int Number of boosting iterations. nfold : int Number of folds in CV.

以后每轮对前一轮训练失败的样本,赋予较大的分布权值( Di 为第i 轮各个样本在样本集中参与训练的概率) ,使其在这一轮训练出现的概率增加,即在后面的训练学习中集中对比较难训练的样本进行学习。

Bagging 和 Boosting¶

一个集合只是一个汇集在一起（例如所有预测的平均值）来作出最终预测的预测器集合。我们使用集成的原因是许多不同的预测变量试图预测相同的目标变量将比任何单一的预测器完成的更好。集成技术进一步分为Bagging和Boosting。

Bagging是一个简单的集成技术，我们建立许多独立的预测变量/模型/学习者，并使用一些模型平均技术将它们结合起来。（例如加权平均数，多数票或正态平均数）。

我们通常对每个模型采用随机的子样本/bootstrap数据，因此所有模型彼此之间几乎没有差别。每个观察结果在所有模型中出现的概率相同。因为这种技术需要许多不相关的学习者做出最终的模型，所以通过减少方差来减少错误。Bagging集成的例子是随机森林模型。

Boosting是一种集成技术，其中预测变量不是独立的，而是按顺序进行的。

这种技术使用了后面的预测变量从之前的预测变量的错误中学习的逻辑。因此，观测值在后续模型中出现的概率是不相同的，而误差最大的出现最频繁。预测变量可以从一系列模型中选择，如决策树，回归量，分类器等等。因为新的预测变量是从以前的预测变量所犯的错误中学习的，所以需要更少的时间/次数来接近实际的预测。但是我们必须慎重选择停机判据，否则可能导致训练数据过度拟合。梯度提升是Boosting算法的一个例子。

学习率¶

$$\hat y = \hat y + \alpha \cdot \frac{\partial \sum(\hat y - y)^2}{\partial \hat y}$$

其中$\alpha$是学习率。 $$\frac{\partial \sum(\hat y - y)^2}{\partial \hat y} =

\alpha \cdot 2 \cdot \sum(\hat y - y)$$

因此，梯度提升算法的直觉就是反复利用残差模式，加强预测能力较弱的模型，使其更好。 一旦我们达到残差没有任何模式可以建模的阶段，我们可以停止建模残差（否则可能导致过度拟合）。就是不要去抓噪音。

XGboost 的适用性¶

适用于

当$N >> col$，如1000个样本对应100个变量
连续变量和分类变量都有的时候，或者只有连续变量

不适用于

稀疏矩阵，如图像数据和文本数据，这个最好使用深度学习
样本量太少

回归评价指标¶

评价指标不是 objective。

$$RMSE = (\frac{\sum (y-\hat y)^2}{n})^{\frac{1}{2}}$$$$MAE = \frac{\sum|y-\hat y|}{n}$$

objective 和 base learner¶

... use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

Linear Base Learner:

Sum of linear terms

Boosted model is weighted sum of linear models (thus is itself linear)

Rarely used

Tree Base Learner:

Decision tree

Boosted model is weighted sum of decision trees (nonlinear)

Almost exclusively used in XGBoost

booster 就是选择 base learner 基学习器，一种就是 tree 一种就是 linear。后者不常用，因为不容易发挥非线性作用。

Objective 就是我们说的 loss functions

reg:linear - use for regression problems
reg:logistic - use for classi×cation problems when you want just decision, not probability
binary:logistic - use when you want probability rather than just decision

Want base learners that when combined create ×nal prediction that is non-linear

每个基学习器合成后，就可以拟合出非线性的关系。

Each base learner should be good at distinguishing or predicting different parts of the dataset

每个基学习器只能很好的学习一部分的数据。

学习器选择¶

from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

[13:35:10] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
RMSE: 9.749041

xg_reg.booster

'gbtree'

DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test = xgb.DMatrix(data=X_test,label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10)
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

[13:35:18] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
RMSE: 6.061921

xg_reg.booster

'gblinear'

因此这类 base learner 选择 linear 反而更好了。因此回归问题时，可以考虑学习器是 linear。

交叉验证¶

一个是 rmse，一个是 mae。

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))

[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:36] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0        17.120438        0.057830       17.151866       0.295723
1        12.353698        0.034427       12.510376       0.372386
2         9.017977        0.038795        9.245965       0.314345
3         6.690101        0.047236        7.060159       0.317659
4         5.069411        0.048644        5.571861       0.252100
4    5.571861
Name: test-rmse-mean, dtype: float64

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='mae', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
   train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
0       15.584812       0.087903      15.567934      0.345122
1       11.036514       0.069404      11.044831      0.347553
2        7.827224       0.052691       7.886081      0.315104
3        5.596108       0.044331       5.718952      0.288004
4        4.062843       0.052193       4.285985      0.175467
4    4.285985
Name: test-mae-mean, dtype: float64

正则化¶

参考 DataCamp

Regularization parameters in XGBoost:

gamma - minimum loss reduction allowed for a split to occur

alpha - l1 regularization on leaf weights, larger values mean more regularization

lambda - l2 regularization on leaf weights

gamma 或者说 leaf 的数量限制，意思差不多。
alpha 和 lambda 针对学习器，alpha 和 beta 的含义不同。
1. 针对于 gbtree 的学习器，就是 leaf weight 进入损失函数进行限制；
2. 针对于 gbtree 的学习器，就是 feature weight 进入损失函数进行限制。类似于 sklearn.LinearRegression 中的 alpha。

X, y = boston.data, boston.target
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
l1_params = [1, 10, 100]
# Create the initial parameter dictionary for varying l1 strength: params
params = {"objective":"reg:linear","max_depth":4} # by default, gbtree
# Create an empty list for storing rmses as a function of l1 complexity
rmses_l1 = []
# Iterate over reg_params
for reg in l1_params:

    # Update l1 strength
    params["alpha"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", 
                             as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l1
    rmses_l1.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:35:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

pd.DataFrame(list(zip(l1_params, rmses_l1)), columns=["l1", "rmse"])

	l2	rmse
0	1	7.139736
1	10	7.950858
2	100	10.937085

	num_boosting_rounds	rmse
0	50	3.335649
1	60	3.336287
2	70	3.335520
3	80	3.336278
4	90	3.336160
5	100	3.336507
6	110	3.336662
7	120	3.336759
8	130	3.336809
9	140	3.336778
10	150	3.336770
11	160	3.336770
12	170	3.336770
13	180	3.336770
14	190	3.336770

	early_stopping_rounds	rmse
0	1.0	3.358666
1	2.0	3.334396
2	3.0	3.334396
3	4.0	3.334396
4	5.0	3.334396
5	6.0	3.334396
6	7.0	3.334396
7	8.0	3.334396
8	9.0	3.334396
9	10.0	3.329619
10	11.0	3.329619
11	12.0	3.329619
12	13.0	3.329619
13	14.0	3.329619
14	15.0	3.329619
15	16.0	3.329619
16	17.0	3.329619
17	18.0	3.329619
18	19.0	3.329619
19	NaN	3.335520

	eta	best_rmse
0	0.001	22.361745
1	0.010	12.648530
2	0.100	3.275962
3	0.200	3.231944
4	0.300	3.334396

	max_depth	best_rmse
0	2	3.314754
1	3	3.262898
2	5	3.237644
3	6	3.231944
4	10	3.287999
5	20	3.300656

	colsample_bytree	best_rmse
0	0.10	5.080896
1	0.50	3.309777
2	0.80	3.440138
3	0.90	3.459225
4	0.95	3.383537
5	1.00	3.231944

分类示例¶

决策树示例¶

交叉验证 xgb.cv¶

AUC¶

Bagging 和 Boosting¶

学习率¶

XGboost 的适用性¶

回归评价指标¶

objective 和 base learner¶

学习器选择¶

交叉验证¶

正则化¶

early stopping¶

learning rate¶

max_depth¶

colsample_bytree¶

subsample¶

grid search¶

Random Search¶

	subsample	best_rmse
0	0.10	3.994764
1	0.50	3.306088
2	0.80	3.167907
3	0.90	3.274909
4	0.95	3.271472
5	1.00	3.231944