%load_ext blackcellmagic
XGboost is a very fast, scalable implementation of gradient boosting that has taken data science by storm, with models using XGBoost regularly winning many online data science competitions and used at scale across different industries. In this course, you'll learn how to use this powerful library alongside pandas and scikit-learn to build and tune supervised learning models. You'll work with real-world datasets to solve classification as well as regression problems.
This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark.
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
中,
[:,:-1]
全部到倒数第二列,-1
是开区间不包含-1
。[:,-1]
最后一列就是$y$random_state=123
类似于set.seed(123)
。
这里的train_test_split
是来自sklearn.model_selection
。
X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
所以在模型中,也是要设置随机种子的。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
churn_data = pd.read_csv('data/ride-sharing.csv')
# import xgboost
import xgboost as xgb
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective="binary:logistic", n_estimators=10, seed=123)
# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)
# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)
# Compute the accuracy: accuracy
accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]
print("accuracy: %f" % (accuracy))
accuracy: 0.743300
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
# Import the necessary modules
from sklearn.tree import DecisionTreeClassifier
# Create the training and test sets
X, y = breast_cancer.data, breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth = 4)
# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)
# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)
# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)
accuracy: 0.9736842105263158
max_depth
为树的高度 sklearn.tree.DecisionTreeClassifier — scikit-learn 0.19.1 documentation
Boosting,就是找一波 weak learner变得strong。
cross-validation: Generates many non-overlapping train/test splits on training data Reports the average test set performance across all data splits
主要防止 boosting 过拟合。
我们可以把整个数据集分成两部分,一部分用于训练,一部分用于验证,这也就是我们经常提到的训练集(training set)和测试集(test set)。 不过,这个简单的方法存在两个弊端。
比起 test set approach,LOOCV有很多优点。首先它不受测试集合训练集划分方法的影响,因为每一个数据都单独的做过测试集。同时,其用了n-1个数据训练模型,也几乎用到了所有的数据,保证了模型的bias更小。不过LOOCV的缺点也很明显,那就是计算量过于大,是test set approach耗时的n-1倍。
K-fold Cross Validation 不难理解,其实LOOCV是一种特殊的K-fold Cross Validation(K=N)。
Bias-Variance Trade-Off for k-Fold Cross-Validation K的选取是一个Bias和Variance的trade-off。
K越大,每次投入的训练集的数据越多,模型的Bias越小。但是K越大,又意味着每一次选取的训练集之前的相关性越大(考虑最极端的例子,当k=N,也就是在LOOCV里,每次都训练数据几乎是一样的)。而这种大相关性会导致最终的test error具有更大的Variance。
参考 机器学习 Cross-Validation(交叉验证)详解
注意这里DMatrix
和XGBClassifier
都是xgboost
(xgb
)的子module。
xgb.cv
中,
num_boosting_rounds
表示要迭代次数。metrics="error"
表示计算$Acc$。# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Print the accuracy
print(np.mean(1-cv_results["test-error-mean"]))
xgboost是 Extreme Gradient Boosting 简称。
Nice work.
cv_results
stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. Fromcv_results
, the final round'test-error-mean'
is extracted and converted into an accuracy, where accuracy is1-error
. The final accuracy of around 75% is an improvement from earlier!
这里体现简单做 cv 可以有效提高 test 组的准确率。
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Print the AUC
print(np.mean(cv_results["test-auc-mean"]))
注意这里的metrics="auc"
要变成metrics=["auc"]
,输入是文本格式。
num_boost_round
: int
Number of boosting iterations.
nfold
: int
Number of folds in CV.
以后每轮对前一轮训练失败的样本,赋予较大的分布权值( Di 为第i 轮各个样本在样本集中参与训练的概率) ,使其在这一轮训练出现的概率增加,即在后面的训练学习中集中对比较难训练的样本进行学习 。
一个集合只是一个汇集在一起(例如所有预测的平均值)来作出最终预测的预测器集合。我们使用集成的原因是许多不同的预测变量试图预测相同的目标变量将比任何单一的预测器完成的更好。集成技术进一步分为Bagging和Boosting。
我们通常对每个模型采用随机的子样本/bootstrap数据,因此所有模型彼此之间几乎没有差别。每个观察结果在所有模型中出现的概率相同。因为这种技术需要许多不相关的学习者做出最终的模型,所以通过减少方差来减少错误。Bagging集成的例子是随机森林模型。
这种技术使用了后面的预测变量从之前的预测变量的错误中学习的逻辑。因此,观测值在后续模型中出现的概率是不相同的,而误差最大的出现最频繁。预测变量可以从一系列模型中选择,如决策树,回归量,分类器等等。因为新的预测变量是从以前的预测变量所犯的错误中学习的,所以需要更少的时间/次数来接近实际的预测。但是我们必须慎重选择停机判据,否则可能导致训练数据过度拟合。梯度提升是Boosting算法的一个例子。
其中$\alpha$是学习率。 $$\frac{\partial \sum(\hat y - y)^2}{\partial \hat y} =
因此,梯度提升算法的直觉就是反复利用残差模式,加强预测能力较弱的模型,使其更好。 一旦我们达到残差没有任何模式可以建模的阶段,我们可以停止建模残差(否则可能导致过度拟合)。 就是不要去抓噪音。
适用于
不适用于
评价指标不是 objective。
$$RMSE = (\frac{\sum (y-\hat y)^2}{n})^{\frac{1}{2}}$$$$MAE = \frac{\sum|y-\hat y|}{n}$$... use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".
Linear Base Learner:
- Sum of linear terms
- Boosted model is weighted sum of linear models (thus is itself linear)
- Rarely used
Tree Base Learner:
- Decision tree
- Boosted model is weighted sum of decision trees (nonlinear)
- Almost exclusively used in XGBoost
booster 就是选择 base learner 基学习器,一种就是 tree 一种就是 linear。后者不常用,因为不容易发挥非线性作用。
Objective 就是我们说的 loss functions
Want base learners that when combined create ×nal prediction that is non-linear
每个基学习器合成后,就可以拟合出非线性的关系。
Each base learner should be good at distinguishing or predicting different parts of the dataset
每个基学习器只能很好的学习一部分的数据。
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
xg_reg.booster
DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test = xgb.DMatrix(data=X_test,label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10)
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
xg_reg.booster
因此这类 base learner 选择 linear 反而更好了。因此回归问题时,可以考虑学习器是 linear。
一个是 rmse,一个是 mae。
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='mae', as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))
参考 DataCamp
Regularization parameters in XGBoost:
- gamma - minimum loss reduction allowed for a split to occur
- alpha - l1 regularization on leaf weights, larger values mean more regularization
- lambda - l2 regularization on leaf weights
X, y = boston.data, boston.target
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
l1_params = [1, 10, 100]
# Create the initial parameter dictionary for varying l1 strength: params
params = {"objective":"reg:linear","max_depth":4} # by default, gbtree
# Create an empty list for storing rmses as a function of l1 complexity
rmses_l1 = []
# Iterate over reg_params
for reg in l1_params:
# Update l1 strength
params["alpha"] = reg
# Pass this updated param dictionary into cv
cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse",
as_pandas=True, seed=123)
# Append best rmse (final round) to rmses_l1
rmses_l1.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])
pd.DataFrame(list(zip(l1_params, rmses_l1)), columns=["l1", "rmse"])
l2 | rmse | |
---|---|---|
0 | 1 | 7.139736 |
1 | 10 | 7.950858 |
2 | 100 | 10.937085 |
num_boosting_rounds | rmse | |
---|---|---|
0 | 50 | 3.335649 |
1 | 60 | 3.336287 |
2 | 70 | 3.335520 |
3 | 80 | 3.336278 |
4 | 90 | 3.336160 |
5 | 100 | 3.336507 |
6 | 110 | 3.336662 |
7 | 120 | 3.336759 |
8 | 130 | 3.336809 |
9 | 140 | 3.336778 |
10 | 150 | 3.336770 |
11 | 160 | 3.336770 |
12 | 170 | 3.336770 |
13 | 180 | 3.336770 |
14 | 190 | 3.336770 |
选择70。
Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric (
"rmse"
in our case) does not improve for a given number of rounds. Here you will use theearly_stopping_rounds
parameter inxgb.cv()
with a large possible number of boosting rounds (50).
early_stopping_rounds
是作用于 test 组上的,看 test 组的 RMSE 是否有持续下降。
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree: params
params = {"objective": "reg:squarederror"}
# Create list of number of boosting rounds
early_stopping_round_list = list(np.multiply(list(range(1,20)),1))
early_stopping_round_list.append(None)
# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []
# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_val in early_stopping_round_list:
# Perform cross-validation: cv_results
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=curr_val,
num_boost_round=70,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append final round RMSE
final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
early_stopping_round_rmses = list(zip(early_stopping_round_list, final_rmse_per_round))
pd.DataFrame(early_stopping_round_rmses, columns=["early_stopping_rounds", "rmse"])
The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.
$\eta \in [0,1]$是学习率。 并且学习率是一个正则化手段,越高越好。
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree (boosting round)
params = {"objective": "reg:squarederror"}
# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1, 0.2, 0.3]
best_rmse = []
# Systematically vary the eta
for curr_val in eta_vals:
params["eta"] = curr_val
# Perform cross-validation: cv_results
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=70,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta", "best_rmse"])
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary
params = {"objective": "reg:squarederror","eta": 0.2}
# Create list of max_depth values
max_depths = [2, 3,5, 6,10, 20]
best_rmse = []
# Systematically vary the max_depth
for curr_val in max_depths:
params["max_depth"] = curr_val
# Perform cross-validation
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=70,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
pd.DataFrame(list(zip(max_depths, best_rmse)), columns=["max_depth", "best_rmse"])
Now, it's time to tune
"colsample_bytree"
. You've already seen this if you've ever worked with scikit-learn'sRandomForestClassifier
orRandomForestRegressor
, where it just was calledmax_features.
In bothxgboost
andsklearn
, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. Inxgboost
,colsample_bytree
must be specified as a float between 0 and 1.
每次分支时,考虑使用的变量占总体的比例。
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary
params = {"objective": "reg:squarederror", "eta": 0.2, "max_depth": 6}
# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 0.9, 0.95, 1]
best_rmse = []
# Systematically vary the hyperparameter value
for curr_val in colsample_bytree_vals:
params["colsample_bytree"] = curr_val
# Perform cross-validation
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=70,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
pd.DataFrame(
list(zip(colsample_bytree_vals, best_rmse)),
columns=["colsample_bytree", "best_rmse"],
)
subsample
dictates the fraction of the training data that is used during any given boosting round.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary
params = {
"objective": "reg:squarederror",
"eta": 0.2,
"max_depth": 6,
"colsample_bytree": 1,
}
# Create list of hyperparameter values: colsample_bytree_vals
subsample_vals = [0.1, 0.5, 0.8, 0.9, 0.95, 1]
best_rmse = []
# Systematically vary the hyperparameter value
for curr_val in subsample_vals:
params["subsample"] = curr_val
# Perform cross-validation
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=70,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
pd.DataFrame(
list(zip(subsample_vals, best_rmse)),
columns=["subsample", "best_rmse"],
)
所以最好是
params = {
"objective": "reg:squarederror",
"eta": 0.2,
"max_depth": 6,
"colsample_bytree": 1,
"subsample": 1
}
early_stopping_rounds=9,
num_boost_round=70,
以上的方式,查询到了每个超参数在其他超参数默认的情况下的最优解,目前将多种情况一起考了,找到附近的最优解。
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
X, y = boston.data, boston.target
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
"num_boost_round": [65,70,75],
"early_stopping_rounds": [9, 10, 11],
"eta": [0.15, 0.2, 0.25],
"max_depth": [5,6,7],
"colsample_bytree": [0.9,0.95,1],
"subsample": [0.9,0.95,1],
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# Perform grid search: grid_mse
grid_mse = GridSearchCV(
estimator=gbm,
param_grid=gbm_param_grid,
scoring="neg_mean_squared_error",
cv=3,
verbose=1,
)
# Fit grid_mse to the data
grid_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
这里verbose=1
主要是导出比较好看的 output。但是太耗费时间。
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
params = {
"objective": "reg:squarederror",
"eta": 0.15,
"max_depth": 5,
"colsample_bytree": 1,
"subsample": 1
}
best_rmse = []
# Systematically vary the hyperparameter value
# Perform cross-validation
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=65,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
print(best_rmse)
X, y = boston.data, boston.target
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
"num_boost_round": [65,70,75],
"early_stopping_rounds": [9, 10, 11],
"eta": [0.15, 0.2, 0.25],
"max_depth": [5,6,7],
"colsample_bytree": [0.9,0.95,1],
"subsample": [0.9,0.95,1],
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# Perform grid search: grid_mse
randomized_mse = RandomizedSearchCV(
estimator=gbm,
param_distributions=gbm_param_grid,
scoring="neg_mean_squared_error",
cv=3,
verbose=1,
)
# Fit grid_mse to the data
randomized_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
params = {
"objective": "reg:squarederror",
"eta": 0.15,
"max_depth": 5,
"colsample_bytree": 0.95,
"subsample": 0.9
}
best_rmse = []
# Systematically vary the hyperparameter value
# Perform cross-validation
cv_results = xgb.cv(
dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=9,
num_boost_round=75,
metrics="rmse",
as_pandas=True,
seed=123,
)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
print(best_rmse)