Target Encoding 学习笔记
2020-02-24
- 使用 RMarkdown 的
child
参数,进行文档拼接。 - 这样拼接以后的笔记方便复习。
- 相关问题提交到 Issue
1 Target statistics
Target statistics(潘华引 2019) 也是模块化,也是 impact code(Zumel 2019),也是 target encoding(Babakhin 2019)
\[\text{Greedy TS}= \mathbb{E}\left(y | x^{i}=x_{k}^{i}\right)\]
\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}}}{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\}}}\]
就是不处理 global mean 的 impact code
为了处理过拟合的情况(target leakage,目标穿越),加入正则化处理。
\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}+a p}}{\sum_{j=1}^{n} \mathbb{I}_{\left\{x{j}^{i}=x_{k}^{i}\right\}}+a}\]
其中是先验概率的权重,而对于先验概率,通常的做法是设置为数据集当中目标变量的平均值(潘华引 2019)。 因此这是一种 weight rating 的处理方式。
但是我们知道这种偏正则化的处理方式是效果不明显的(Zumel 2019),因此需要使用交叉验证相关的方法,这里引入 Holdout TS。
2 主要理论
The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit. (Zumel 2019)
beta 越大,越受到 train data 的影响。
Cross-validation is a way to safely reuse training data in nested model situations. (Zumel 2019)
交叉验证的优点
- safety 才是关键
- reuse 才是关键
In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias. (Zumel 2019)
超参数的学习和训练要分开。 这就是预训练模型的操作方式。 也就是说 target encoding 和把预训练后的规则去跑模型是两方面。
Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation? (Zumel 2019) The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.” (Zumel 2019)
cv 的计算更加耗时。
One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models. (Zumel 2019)
这是 target encoding 的方法。
Impact coding 的方式。
specifically for dealing with high-cardinality categorical variables (Zumel 2019)
当变量是有排序的,那么是可以这么处理,不是无序的分类变量。
Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit. (Zumel 2019)
这里证明了乱用 target encoding,聚合的变量就会很显著,这样的话是过拟合的,是误导。
The correct way to impact code (or to nest models in general) is to use cross-validation techniques. (Zumel 2019)
这里提出来 cv 来搞 target encoding。
We can compare the performance of this model to the naive model on holdout data. (Zumel 2019)
这样的话,cat 随机变量没有显著。 而且 R^2 还提高了,在测试集上。
But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean. (Zumel 2019)
\[v_{i}=\Sigma_{x_{j}=s_{i}} y_{i} /\left(\operatorname{count}_{i}+\lambda\right)-E\left[y_{i}\right]\]
正则化就是拉普拉斯平滑。
http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/
Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.(Zumel 2019)
可以证明 target encoding 和 stacking 的正确做法。
3 实现代码
参考 GitHub
我们发现,不正确 target encoding 处理方式容易产生过拟合。 处理过拟合常用的是
- 交叉验证 (这就是正确的 target encoding 方式)
- 正则化
我们将尝试这两种方法,以产出正确的处理过拟合的方式。
3.1 生成数据
set.seed(3453421)
Ndata <- 500
nnoise <- 100
nsig <- 10
noise_levels <- create_level(nnoise)
signal_levels <- create_level(nsig)
sig_amps <-
runif(1:nsig, min=-1, max=1) %>%
multiply_by(2) %>%
`names<-`(signal_levels) %>%
{.-mean(.)}
x_s <- sample(signal_levels, Ndata, replace=TRUE)
y <- sig_amps[x_s] + rnorm(Ndata)
# names 函数帮助 sig_amps 提取对应 level 的值
# y 是 x_s 的函数,但是不是 x_n 的函数,x_n 是随机的,因此应该不应该有随机性。
x_n <- sample(noise_levels, Ndata, replace=TRUE)
df <- data.frame(x_s=x_s, x_n=x_n, y=y, stringsAsFactors=FALSE)
library(zeallot)
c(dtest, dtrain) %<-% split(df, runif(Ndata) < 0.5) # false comes first
dtrain %>% head
##
## n_01 n_02 n_03 n_04 n_05 n_06 n_07 n_08 n_09 n_10
## 22 27 34 18 26 23 28 22 24 22
##
## n_01 n_02 n_04 n_05 n_06 n_07 n_08 n_09 n_10 n_100 n_11 n_12 n_13
## 3 3 2 1 2 3 2 3 1 1 3 2 1
## n_14 n_15 n_16 n_17 n_18 n_19 n_20 n_21 n_22 n_23 n_24 n_25 n_26
## 4 2 4 1 2 2 5 1 3 3 6 4 7
## n_27 n_29 n_30 n_31 n_33 n_34 n_35 n_36 n_37 n_38 n_39 n_40 n_41
## 3 2 4 3 3 3 4 5 3 2 3 2 3
## n_42 n_44 n_45 n_46 n_47 n_48 n_50 n_51 n_52 n_53 n_54 n_55 n_56
## 6 4 1 3 1 2 1 1 3 4 3 3 5
## n_57 n_58 n_59 n_61 n_62 n_63 n_65 n_66 n_67 n_68 n_71 n_72 n_73
## 2 1 3 4 3 3 4 6 2 4 3 1 4
## n_74 n_75 n_76 n_77 n_78 n_79 n_80 n_81 n_82 n_83 n_84 n_85 n_86
## 3 3 4 3 2 2 3 3 2 1 2 1 2
## n_87 n_88 n_89 n_90 n_91 n_92 n_93 n_94 n_95 n_98 n_99
## 2 4 2 1 3 1 2 3 3 2 4
## [1] "我" "他" "我" "你" "他" "我" "他" "他" "他" "他"
3.2 直接用训练集
y - mean(y)
是 target encoding 的处理方法。
\[v_i = E[y | x = s_i] - E[y]\]
\(s_i\)指的是第\(i\)个 level.
library(rqdatatable)
library(wrapr)
library(magrittr)
xs_means <- get_level_mean(dtrain, x_s)
xn_means <- get_level_mean(dtrain, x_n)
xs_means %>% head
meany
衡量每个 level 的y均值,coeff 是减去了 y 的 global mean。
# the maps that convert categorical levels to numerical values
xs_map <- with(xs_means, x_s := coeff)
xn_map <- with(xn_means, x_n := meany)
prepare_manually <- function(coefmap, xcol) {
treated = coefmap[xcol]
ifelse(is.na(treated), 0, treated)
}
# "prepare" the data
dtrain_treated <-
dtrain %>%
mutate(
vs = prepare_manually(xs_map, dtrain$x_s),
vn = prepare_manually(xn_map, dtrain$x_n)
)
head(dtrain_treated)
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9782 -0.5588 0.0022 0.5574 2.4658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01023 0.05642 -0.181 0.856
## vs 0.78055 0.06132 12.730 < 2e-16 ***
## vn 0.61687 0.07202 8.565 1.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8844 on 243 degrees of freedom
## Multiple R-squared: 0.6195, Adjusted R-squared: 0.6163
## F-statistic: 197.8 on 2 and 243 DF, p-value: < 2.2e-16
dtest_treated <-
dtest %>%
mutate(
vs = prepare_manually(xs_map, dtest$x_s),
vn = prepare_manually(xn_map, dtest$x_n)
)
pframe$ypred_naive = predict(model_raw, newdata=dtest_treated)
# look at the predictions on holdout data
compare_models(pframe) %>%
mutate(rsq_train = model_raw %>%
broom::glance() %>%
.$r.squared)
比较 train 和 test 组,R^2 可以体现过拟合。
3.3 交叉验证
library(vtreat)
library(wrapr)
xframeResults <- mkCrossFrameNExperiment(dtrain,
qc(x_s, x_n),
"y",
codeRestriction = qc(catN),
verbose = FALSE)
# the plan uses the one-variable models to treat data
treatmentPlan <- xframeResults$treatments
# the cross-frame
dtrain_treated <- xframeResults$crossFrame
# 这个函数直接完成 cv 下的 target encoding
head(dtrain_treated)
variables <- setdiff(colnames(dtrain_treated), "y")
model_X <- lm(mk_formula("y", variables),
data = dtrain_treated)
summary(model_X)
##
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.85533 -0.70304 -0.04544 0.69367 2.98327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.013627 0.066773 -0.204 0.838
## x_s_catN 0.963333 0.065729 14.656 <2e-16 ***
## x_n_catN 0.006362 0.062030 0.103 0.918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.044 on 243 degrees of freedom
## Multiple R-squared: 0.4695, Adjusted R-squared: 0.4651
## F-statistic: 107.5 on 2 and 243 DF, p-value: < 2.2e-16
\(R_{\text{train}}^2\)变小了 \(R_{\text{test}}^2\)也不会高,所以过拟合减少。
希望 \(R_{\text{test}}^2\) 变高。
dtest_treated <- prepare(treatmentPlan, dtest)
pframe$ypred_crossval = predict(model_X, newdata=dtest_treated)
compare_models(pframe)
更高的 test 组 R^2 并且过拟合不明显,那么就说明 cv 更好。
3.4 正则化
For a one-variable model, L2-regularization is simply Laplace smoothing.
L2 正则化类似于拉普拉斯变换。
\[v_i = \sum\nolimits_{x_j = s_i} y_i/(\text{count}_i + \lambda) - E[y_i]\]
# build the coefficients
lambda <- 1
xs_regmap <- dtrain %.>%
extend(., grandmean = mean(y)) %.>%
project(.,
sum_y := sum(y),
count_y := n(),
grandmean := mean(grandmean), # pseudo-aggregator
groupby = 'x_s') %.>%
extend(.,
vs := (sum_y/(count_y + lambda)) - grandmean
) %.>%
order_rows(.,
'x_s') %.>%
as.data.frame(.)
xs_regmap
xn_regmap <- dtrain %.>%
extend(., grandmean = mean(y)) %.>%
project(.,
sum_y := sum(y),
count_y := n(),
grandmean := mean(grandmean), # pseudo-aggregator
groupby = 'x_n') %.>%
extend(.,
vn := (sum_y/(count_y + lambda)) - grandmean
) %.>%
order_rows(.,
'x_n') %.>%
as.data.frame(.)
# the maps that convert categorical levels to numerical values
vs_map <- xs_regmap$x_s := xs_regmap$vs
vn_map <- xn_regmap$x_n := xn_regmap$vn
# "prepare" the data
dtrain_treated <- dtrain
dtrain_treated$vs = prepare_manually(vs_map, dtrain$x_s)
dtrain_treated$vn = prepare_manually(vn_map, dtrain$x_n)
head(dtrain_treated)
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0167 -0.5639 0.0159 0.5472 2.4304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02210 0.05658 -0.391 0.696
## vs 0.81520 0.06389 12.760 < 2e-16 ***
## vn 0.80144 0.09495 8.441 2.86e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8873 on 243 degrees of freedom
## Multiple R-squared: 0.6169, Adjusted R-squared: 0.6138
## F-statistic: 195.7 on 2 and 243 DF, p-value: < 2.2e-16
# apply to test data
dtest_treated <- dtest
dtest_treated$vs = prepare_manually(vs_map, dtest$x_s)
dtest_treated$vn = prepare_manually(vn_map, dtest$x_n)
pframe$ypred_reg = predict(model_reg, newdata=dtest_treated)
# compare the predictions of each model
compare_models(pframe)
可以发现 L2 正则化收效甚微。
3.5 正则化 2.0
# build the coefficients
lambda <- 100
xs_regmap <- dtrain %.>%
extend(., grandmean = mean(y)) %.>%
project(.,
sum_y := sum(y),
count_y := n(),
grandmean := mean(grandmean), # pseudo-aggregator
groupby = 'x_s') %.>%
extend(.,
vs := (sum_y/(count_y + lambda)) - grandmean
) %.>%
order_rows(.,
'x_s') %.>%
as.data.frame(.)
xs_regmap
xn_regmap <- dtrain %.>%
extend(., grandmean = mean(y)) %.>%
project(.,
sum_y := sum(y),
count_y := n(),
grandmean := mean(grandmean), # pseudo-aggregator
groupby = 'x_n') %.>%
extend(.,
vn := (sum_y/(count_y + lambda)) - grandmean
) %.>%
order_rows(.,
'x_n') %.>%
as.data.frame(.)
# the maps that convert categorical levels to numerical values
vs_map <- xs_regmap$x_s := xs_regmap$vs
vn_map <- xn_regmap$x_n := xn_regmap$vn
# "prepare" the data
dtrain_treated <- dtrain
dtrain_treated$vs = prepare_manually(vs_map, dtrain$x_s)
dtrain_treated$vn = prepare_manually(vn_map, dtrain$x_n)
head(dtrain_treated)
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01570 -0.57022 -0.00065 0.57939 2.49627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.40213 0.07306 -5.504 9.40e-08 ***
## vs 3.98571 0.30395 13.113 < 2e-16 ***
## vn 14.84943 1.95003 7.615 5.84e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.909 on 243 degrees of freedom
## Multiple R-squared: 0.598, Adjusted R-squared: 0.5947
## F-statistic: 180.7 on 2 and 243 DF, p-value: < 2.2e-16
# apply to test data
dtest_treated <- dtest
dtest_treated$vs = prepare_manually(vs_map, dtest$x_s)
dtest_treated$vn = prepare_manually(vn_map, dtest$x_n)
pframe$ypred_reg = predict(model_reg, newdata=dtest_treated)
# compare the predictions of each model
compare_models(pframe)
可以发现 L2 正则化增大了,收效好一些,但是还是不如 cv。
3.6 总结
不正确的使用 target encoding 存在过拟合,但是目前发现正则化收效甚微,直接用CV吧。
这个可以用 stacking 来理解。 stacking 中要求同一个训练样本不要重复使用,是指的是不要再两层,同一层可以重复调用。
因此这里 target encoding 给出的预测值,和最后使用 target encoding 的变量给出的预测值,原则上不能是同一样本。 这里进行 CV 处理后,target encoding 使用的样本和最后训练模型,使用的样本就没用那么高度重合了。
进一步可以理解 stacking 了。
- Split train data into two parts
- Train multiple models on Part 1
- Make predictions on Part 2
- Make predictions on the test data
- Train a new model on Part 2 using predictions as features
- Make predictions on the test data using the 2nd level model
- data => train + hold-out
- train => p1 + p2
- p1 训练 1st-level 模型,并且给出 p2 和 hold-out 组上的预测,之后 p2 预测值用于训练,hold-out 预测值用于 2nd-level 模型预测
- p2 训练 2nd-level 模型,也叫做 meta-model,在 hold-out 上进行预测
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=0.5, random_state=123)
# Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)
# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)
# Make predictions on the Part 2 data
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])
# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])
from sklearn.linear_model import LinearRegression
# Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)
# Train 2nd level model in the part_2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)
# Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])
# Look at the model coefficients
print(lr.coef_)
Usually, the 2nd level model is some simple model like Linear or Logistic Regressions. Also, note that you were not using intercept in the Linear Regression just to combine pure model predictions. Looking at the coefficients, it’s clear that 2nd level model has more trust in the Gradient Boosting: 0.7 versus 0.3 for the Random Forest model. (Babakhin 2019)
这是一些业界的思路,不要用截距。
3.7 相关函数展示
## ../R
## +-- create_level.R
## +-- evaluate_model.R
## +-- get_level_mean.R
## +-- load.R
## \-- path_md_fence.R
$../R/create_level.R
$../R/evaluate_model.R
# function to calculate the rmse
rmse <- function(ypred, y) {
resid = y - ypred
sqrt(mean(resid ^ 2))
}
# function to calculate R-squared
rsquared <- function(ypred, y) {
null_variance = sum((y - mean(y)) ^ 2)
resid_variance = sum((y - ypred) ^ 2)
1 - (resid_variance / null_variance)
}
compare_models <- function(predframe) {
predictions = setdiff(colnames(predframe), "y")
data.frame(
# pred_type = predictions,
rmse = vapply(
predframe[, predictions, drop = FALSE],
FUN = function(p)
rmse(p, predframe$y),
numeric(1)
),
rsq_test = vapply(
predframe[, predictions, drop = FALSE],
FUN = function(p)
rsquared(p, predframe$y),
numeric(1)
)
)
}
$../R/get_level_mean.R
get_level_mean <- function(dtrain, col) {
xs_means <- dtrain %>%
mutate(delta = y - mean(y)) %>%
group_by({{col}}) %>%
summarise(meany = mean(y),
coeff = mean(delta)) %>%
arrange({{col}})
return(xs_means)
}
$../R/load.R
# library -----------------------------------------------------------------
library(tidyverse)
library(fs)
library(glue)
library(rio)
library(here)
## rmarkdown -------------------------------------------------------------
library(knitr)
library(kableExtra)
library(markdown)
# souce custom func -------------------------------------------------------
# here("R") %>% dir_info() %>%
# filter(!path %>% str_detect('load.R')) %>%
# mutate(
# source = map(path,source)
# )
$../R/path_md_fence.R
4 Python 实现示例
(175008, 35)
boxscore_cols = [
'T1_FGM', 'T1_FGA', 'T1_FGM3', 'T1_FGA3', 'T1_OR', 'T1_Ast', 'T1_TO', 'T1_Stl', 'T1_PF',
'T2_FGM', 'T2_FGA', 'T2_FGM3', 'T2_FGA3', 'T2_OR', 'T2_Ast', 'T2_TO', 'T2_Stl', 'T2_Blk',
'PointDiff']
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=123)
import numpy as np
import pandas as pd
cv_test_list = []
funcs = [np.mean]
fold = 0
for train_index, test_index in kf.split(regular_data):
cv_train, cv_test = regular_data.iloc[train_index], regular_data.iloc[test_index]
season_statistics = cv_train.groupby(["Season", 'T1_TeamID'])[boxscore_cols].agg(funcs).reset_index()
season_statistics.columns = [''.join(col).strip() for col in season_statistics.columns.values]
season_statistics_T1 = season_statistics.copy()
season_statistics_T2 = season_statistics.copy()
season_statistics_T1.columns = ["T1_" + x.replace("T1_","").replace("T2_","opponent_") for x in list(season_statistics_T1.columns)]
season_statistics_T2.columns = ["T2_" + x.replace("T1_","").replace("T2_","opponent_") for x in list(season_statistics_T2.columns)]
season_statistics_T1.columns.values[0] = "Season"
season_statistics_T2.columns.values[0] = "Season"
cv_test =pd.merge(cv_test, season_statistics_T1, on = ['Season', 'T1_TeamID'], how = 'left')
cv_test =pd.merge(cv_test, season_statistics_T2, on = ['Season', 'T2_TeamID'], how = 'left')
cv_test_list.append(cv_test)
print('Fold: {}'.format(fold))
print('CV train shape: {}'.format(cv_train.shape))
fold += 1
Fold: 0
CV train shape: (140006, 35)
Fold: 1
CV train shape: (140006, 35)
Fold: 2
CV train shape: (140006, 35)
Fold: 3
CV train shape: (140007, 35)
Fold: 4
CV train shape: (140007, 35)
5
pandas.core.frame.DataFrame
# 0
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
# 1
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
# 2
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
# 3
<class 'pandas.core.frame.DataFrame'>
(35001, 73)
# 4
<class 'pandas.core.frame.DataFrame'>
(35001, 73)
Season | DayNum | T1_TeamID | T1_Score | T2_TeamID | T2_Score | location | NumOT | T1_FGM | T1_FGA | … | T2_opponent_FGMmean | T2_opponent_FGAmean | T2_opponent_FGM3mean | T2_opponent_FGA3mean | T2_opponent_ORmean | T2_opponent_Astmean | T2_opponent_TOmean | T2_opponent_Stlmean | T2_opponent_Blkmean | T2_PointDiffmean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2003 | 13 | 1166 | 106 | 1426 | 50 | 1 | 0 | 41 | 69 | … | 23.708333 | 54.708333 | 5.708333 | 17.458333 | 10.583333 | 11.416667 | 14.041667 | 7.625000 | 3.125000 | -0.583333 |
1 | 2003 | 14 | 1353 | 60 | 1162 | 36 | 1 | 0 | 23 | 57 | … | 21.894737 | 50.842105 | 6.789474 | 18.315789 | 10.315789 | 12.947368 | 12.578947 | 9.263158 | 3.947368 | -15.368421 |
2 | 2003 | 14 | 1390 | 61 | 1131 | 57 | 1 | 0 | 20 | 53 | … | 20.880000 | 51.680000 | 4.880000 | 16.600000 | 10.640000 | 11.920000 | 13.240000 | 6.840000 | 3.120000 | 5.840000 |
3 | 2003 | 14 | 1426 | 59 | 1106 | 47 | 0 | 0 | 25 | 53 | … | 22.043478 | 52.608696 | 4.869565 | 15.652174 | 11.130435 | 12.434783 | 14.782609 | 8.521739 | 3.043478 | -1.130435 |
4 | 2003 | 18 | 1113 | 59 | 1287 | 56 | 1 | 0 | 22 | 54 | … | 28.157895 | 58.894737 | 7.052632 | 19.631579 | 12.000000 | 16.263158 | 16.052632 | 5.789474 | 3.473684 | 2.578947 |
5 rows × 73 columns
(175008, 73)
附录
参考文献
Babakhin, Yauhen. 2019. “Winning a Kaggle Competition in Python.” DataCamp. 2019. https://www.datacamp.com/courses/winning-a-kaggle-competition-in-python.
Zumel, Nina. 2019. “When Cross-Validation Is More Powerful Than Regularization.” Win-Vector Blog. 2019. http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/.
潘华引. 2019. “一文详尽系列之CatBoost.” Datawhale. 2019. https://mp.weixin.qq.com/s/E3pSPsG18053F5GG1Z8jNQ.