使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue

1 Target statistics

Target statistics(潘华引 2019) 也是模块化，也是 impact code(Zumel 2019)，也是 target encoding(Babakhin 2019)

\[\text{Greedy TS}= \mathbb{E}\left(y | x^{i}=x_{k}^{i}\right)\]

\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}}}{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\}}}\]

就是不处理 global mean 的 impact code

为了处理过拟合的情况(target leakage，目标穿越)，加入正则化处理。

\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}+a p}}{\sum_{j=1}^{n} \mathbb{I}_{\left\{x{j}^{i}=x_{k}^{i}\right\}}+a}\]

其中是先验概率的权重，而对于先验概率，通常的做法是设置为数据集当中目标变量的平均值(潘华引 2019)。因此这是一种 weight rating 的处理方式。

但是我们知道这种偏正则化的处理方式是效果不明显的(Zumel 2019)，因此需要使用交叉验证相关的方法，这里引入 Holdout TS。

2 主要理论

The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit. (Zumel 2019)

beta 越大，越受到 train data 的影响。

Cross-validation is a way to safely reuse training data in nested model situations. (Zumel 2019)

交叉验证的优点

safety 才是关键
reuse 才是关键

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias. (Zumel 2019)

超参数的学习和训练要分开。这就是预训练模型的操作方式。也就是说 target encoding 和把预训练后的规则去跑模型是两方面。

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation? (Zumel 2019) The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.” (Zumel 2019)

cv 的计算更加耗时。

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models. (Zumel 2019)

这是 target encoding 的方法。

Impact coding 的方式。

specifically for dealing with high-cardinality categorical variables (Zumel 2019)

当变量是有排序的，那么是可以这么处理，不是无序的分类变量。

Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit. (Zumel 2019)

这里证明了乱用 target encoding，聚合的变量就会很显著，这样的话是过拟合的，是误导。

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. (Zumel 2019)

这里提出来 cv 来搞 target encoding。

We can compare the performance of this model to the naive model on holdout data. (Zumel 2019)

这样的话，cat 随机变量没有显著。而且 R^2 还提高了，在测试集上。

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean. (Zumel 2019)

\[v_{i}=\Sigma_{x_{j}=s_{i}} y_{i} /\left(\operatorname{count}_{i}+\lambda\right)-E\left[y_{i}\right]\]

正则化就是拉普拉斯平滑。

http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.(Zumel 2019)

可以证明 target encoding 和 stacking 的正确做法。

3 实现代码

参考 GitHub

我们发现，不正确 target encoding 处理方式容易产生过拟合。处理过拟合常用的是

交叉验证 (这就是正确的 target encoding 方式)
正则化

我们将尝试这两种方法，以产出正确的处理过拟合的方式。

library(tidyverse)
library(magrittr)
devtools::load_all()

3.1 生成数据

set.seed(3453421)
Ndata <- 500
nnoise <- 100
nsig <- 10

noise_levels <- create_level(nnoise)
signal_levels <- create_level(nsig)

sig_amps <- 
  runif(1:nsig, min=-1, max=1) %>% 
  multiply_by(2) %>% 
  `names<-`(signal_levels) %>% 
  {.-mean(.)}
x_s <- sample(signal_levels, Ndata, replace=TRUE)
y <- sig_amps[x_s]  + rnorm(Ndata)
# names 函数帮助 sig_amps 提取对应 level 的值
# y 是 x_s 的函数，但是不是 x_n 的函数，x_n 是随机的，因此应该不应该有随机性。

x_n <- sample(noise_levels, Ndata, replace=TRUE)
df <- data.frame(x_s=x_s, x_n=x_n, y=y, stringsAsFactors=FALSE)

library(zeallot)
c(dtest, dtrain) %<-%  split(df, runif(Ndata) < 0.5)  # false comes first

dtrain %>% head

pframe <- data.frame(y = dtest$y)

table(dtrain$x_s)

## 
## n_01 n_02 n_03 n_04 n_05 n_06 n_07 n_08 n_09 n_10 
##   22   27   34   18   26   23   28   22   24   22

table(dtrain$x_n)

## 
##  n_01  n_02  n_04  n_05  n_06  n_07  n_08  n_09  n_10 n_100  n_11  n_12  n_13 
##     3     3     2     1     2     3     2     3     1     1     3     2     1 
##  n_14  n_15  n_16  n_17  n_18  n_19  n_20  n_21  n_22  n_23  n_24  n_25  n_26 
##     4     2     4     1     2     2     5     1     3     3     6     4     7 
##  n_27  n_29  n_30  n_31  n_33  n_34  n_35  n_36  n_37  n_38  n_39  n_40  n_41 
##     3     2     4     3     3     3     4     5     3     2     3     2     3 
##  n_42  n_44  n_45  n_46  n_47  n_48  n_50  n_51  n_52  n_53  n_54  n_55  n_56 
##     6     4     1     3     1     2     1     1     3     4     3     3     5 
##  n_57  n_58  n_59  n_61  n_62  n_63  n_65  n_66  n_67  n_68  n_71  n_72  n_73 
##     2     1     3     4     3     3     4     6     2     4     3     1     4 
##  n_74  n_75  n_76  n_77  n_78  n_79  n_80  n_81  n_82  n_83  n_84  n_85  n_86 
##     3     3     4     3     2     2     3     3     2     1     2     1     2 
##  n_87  n_88  n_89  n_90  n_91  n_92  n_93  n_94  n_95  n_98  n_99 
##     2     4     2     1     3     1     2     3     3     2     4

# sample 函数的解释
sample(c('我','你','他'), 10, replace=TRUE)

##  [1] "我" "他" "我" "你" "他" "我" "他" "他" "他" "他"

3.2 直接用训练集

y - mean(y)是 target encoding 的处理方法。

\[v_i = E[y | x = s_i] - E[y]\]

$s_i$指的是第$i$个 level.

library(rqdatatable)
library(wrapr)
library(magrittr)
xs_means <- get_level_mean(dtrain, x_s)
xn_means <- get_level_mean(dtrain, x_n)
xs_means %>% head

xn_means %>% head

meany衡量每个 level 的y均值，coeff 是减去了 y 的 global mean。

# the maps that convert categorical levels to numerical values
xs_map <- with(xs_means, x_s := coeff)
xn_map <- with(xn_means, x_n := meany)

prepare_manually <- function(coefmap, xcol) {
  treated = coefmap[xcol]
  ifelse(is.na(treated), 0, treated)
}


# "prepare" the data
dtrain_treated <- 
  dtrain %>% 
  mutate(
    vs = prepare_manually(xs_map, dtrain$x_s),
    vn = prepare_manually(xn_map, dtrain$x_n)
  )

head(dtrain_treated)

model_raw <- lm(y ~ vs + vn,
               data = dtrain_treated)
summary(model_raw)

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9782 -0.5588  0.0022  0.5574  2.4658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01023    0.05642  -0.181    0.856    
## vs           0.78055    0.06132  12.730  < 2e-16 ***
## vn           0.61687    0.07202   8.565 1.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8844 on 243 degrees of freedom
## Multiple R-squared:  0.6195, Adjusted R-squared:  0.6163 
## F-statistic: 197.8 on 2 and 243 DF,  p-value: < 2.2e-16

dtest_treated <- 
  dtest %>% 
  mutate(
    vs = prepare_manually(xs_map, dtest$x_s),
    vn = prepare_manually(xn_map, dtest$x_n)
  )

pframe$ypred_naive = predict(model_raw, newdata=dtest_treated)

# look at the predictions on holdout data
compare_models(pframe) %>%
  mutate(rsq_train = model_raw %>%
           broom::glance() %>%
           .$r.squared)

比较 train 和 test 组，R^2 可以体现过拟合。

3.3 交叉验证

library(vtreat)
library(wrapr)
xframeResults <- mkCrossFrameNExperiment(dtrain,
                                         qc(x_s, x_n),
                                         "y",
                                         codeRestriction = qc(catN),
                                         verbose = FALSE)
# the plan uses the one-variable models to treat data
treatmentPlan <- xframeResults$treatments
# the cross-frame
dtrain_treated <- xframeResults$crossFrame
# 这个函数直接完成 cv 下的 target encoding

head(dtrain_treated)

variables <- setdiff(colnames(dtrain_treated), "y")

model_X <- lm(mk_formula("y", variables),
              data = dtrain_treated)
summary(model_X)

## 
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85533 -0.70304 -0.04544  0.69367  2.98327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.013627   0.066773  -0.204    0.838    
## x_s_catN     0.963333   0.065729  14.656   <2e-16 ***
## x_n_catN     0.006362   0.062030   0.103    0.918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.044 on 243 degrees of freedom
## Multiple R-squared:  0.4695, Adjusted R-squared:  0.4651 
## F-statistic: 107.5 on 2 and 243 DF,  p-value: < 2.2e-16

$R_{\text{train}}^2$变小了 $R_{\text{test}}^2$也不会高，所以过拟合减少。

希望 $R_{\text{test}}^2$ 变高。

dtest_treated <- prepare(treatmentPlan, dtest)
pframe$ypred_crossval = predict(model_X, newdata=dtest_treated)

compare_models(pframe)

更高的 test 组 R^2 并且过拟合不明显，那么就说明 cv 更好。

3.4 正则化

For a one-variable model, L2-regularization is simply Laplace smoothing.

L2 正则化类似于拉普拉斯变换。

\[v_i = \sum\nolimits_{x_j = s_i} y_i/(\text{count}_i + \lambda) - E[y_i]\]

# build the coefficients
lambda <- 1
 
xs_regmap <- dtrain %.>% 
  extend(., grandmean = mean(y)) %.>%
  project(.,
          sum_y := sum(y),
          count_y := n(),
          grandmean := mean(grandmean), # pseudo-aggregator
          groupby = 'x_s') %.>% 
  extend(.,
          vs := (sum_y/(count_y  + lambda)) - grandmean
         ) %.>%
  order_rows(.,
             'x_s') %.>%
  as.data.frame(.) 

xs_regmap

xn_regmap <- dtrain %.>% 
  extend(., grandmean = mean(y)) %.>%
  project(.,
          sum_y := sum(y),
          count_y := n(),
          grandmean := mean(grandmean), # pseudo-aggregator
          groupby = 'x_n') %.>% 
  extend(.,
          vn := (sum_y/(count_y  + lambda)) - grandmean
         ) %.>%
  order_rows(.,
             'x_n') %.>%
  as.data.frame(.) 

# the maps that convert categorical levels to numerical values
vs_map <- xs_regmap$x_s :=  xs_regmap$vs
vn_map <- xn_regmap$x_n :=  xn_regmap$vn

# "prepare" the data
dtrain_treated <- dtrain
dtrain_treated$vs = prepare_manually(vs_map, dtrain$x_s)
dtrain_treated$vn = prepare_manually(vn_map, dtrain$x_n)

head(dtrain_treated)

model_reg <- lm(y ~ vs + vn, data=dtrain_treated)
summary(model_reg)

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0167 -0.5639  0.0159  0.5472  2.4304 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02210    0.05658  -0.391    0.696    
## vs           0.81520    0.06389  12.760  < 2e-16 ***
## vn           0.80144    0.09495   8.441 2.86e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8873 on 243 degrees of freedom
## Multiple R-squared:  0.6169, Adjusted R-squared:  0.6138 
## F-statistic: 195.7 on 2 and 243 DF,  p-value: < 2.2e-16

# apply to test data
dtest_treated <- dtest 
dtest_treated$vs = prepare_manually(vs_map, dtest$x_s)
dtest_treated$vn = prepare_manually(vn_map, dtest$x_n)

pframe$ypred_reg = predict(model_reg, newdata=dtest_treated)

# compare the predictions of each model
compare_models(pframe)

可以发现 L2 正则化收效甚微。

3.5 正则化 2.0

# build the coefficients
lambda <- 100
 
xs_regmap <- dtrain %.>% 
  extend(., grandmean = mean(y)) %.>%
  project(.,
          sum_y := sum(y),
          count_y := n(),
          grandmean := mean(grandmean), # pseudo-aggregator
          groupby = 'x_s') %.>% 
  extend(.,
          vs := (sum_y/(count_y  + lambda)) - grandmean
         ) %.>%
  order_rows(.,
             'x_s') %.>%
  as.data.frame(.) 

xs_regmap

xn_regmap <- dtrain %.>% 
  extend(., grandmean = mean(y)) %.>%
  project(.,
          sum_y := sum(y),
          count_y := n(),
          grandmean := mean(grandmean), # pseudo-aggregator
          groupby = 'x_n') %.>% 
  extend(.,
          vn := (sum_y/(count_y  + lambda)) - grandmean
         ) %.>%
  order_rows(.,
             'x_n') %.>%
  as.data.frame(.) 

# the maps that convert categorical levels to numerical values
vs_map <- xs_regmap$x_s :=  xs_regmap$vs
vn_map <- xn_regmap$x_n :=  xn_regmap$vn

# "prepare" the data
dtrain_treated <- dtrain
dtrain_treated$vs = prepare_manually(vs_map, dtrain$x_s)
dtrain_treated$vn = prepare_manually(vn_map, dtrain$x_n)

head(dtrain_treated)

model_reg <- lm(y ~ vs + vn, data=dtrain_treated)
summary(model_reg)

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01570 -0.57022 -0.00065  0.57939  2.49627 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.40213    0.07306  -5.504 9.40e-08 ***
## vs           3.98571    0.30395  13.113  < 2e-16 ***
## vn          14.84943    1.95003   7.615 5.84e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.909 on 243 degrees of freedom
## Multiple R-squared:  0.598,  Adjusted R-squared:  0.5947 
## F-statistic: 180.7 on 2 and 243 DF,  p-value: < 2.2e-16

# apply to test data
dtest_treated <- dtest 
dtest_treated$vs = prepare_manually(vs_map, dtest$x_s)
dtest_treated$vn = prepare_manually(vn_map, dtest$x_n)

pframe$ypred_reg = predict(model_reg, newdata=dtest_treated)

# compare the predictions of each model
compare_models(pframe)

可以发现 L2 正则化增大了，收效好一些，但是还是不如 cv。

3.6 总结

不正确的使用 target encoding 存在过拟合，但是目前发现正则化收效甚微，直接用CV吧。

这个可以用 stacking 来理解。 stacking 中要求同一个训练样本不要重复使用，是指的是不要再两层，同一层可以重复调用。

因此这里 target encoding 给出的预测值，和最后使用 target encoding 的变量给出的预测值，原则上不能是同一样本。这里进行 CV 处理后，target encoding 使用的样本和最后训练模型，使用的样本就没用那么高度重合了。

进一步可以理解 stacking 了。

Split train data into two parts

Train multiple models on Part 1

Make predictions on Part 2

Make predictions on the test data

Train a new model on Part 2 using predictions as features

Make predictions on the test data using the 2nd level model

data => train + hold-out
train => p1 + p2
p1 训练 1st-level 模型，并且给出 p2 和 hold-out 组上的预测，之后 p2 预测值用于训练，hold-out 预测值用于 2nd-level 模型预测
p2 训练 2nd-level 模型，也叫做 meta-model，在 hold-out 上进行预测

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=0.5, random_state=123)

# Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)

# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)

# Make predictions on the Part 2 data
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])

# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

from sklearn.linear_model import LinearRegression

# Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)

# Train 2nd level model in the part_2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)

# Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])

# Look at the model coefficients
print(lr.coef_)

<script.py> output:
    [0.72504358 0.27647395]

Usually, the 2nd level model is some simple model like Linear or Logistic Regressions. Also, note that you were not using intercept in the Linear Regression just to combine pure model predictions. Looking at the coefficients, it’s clear that 2nd level model has more trust in the Gradient Boosting: 0.7 versus 0.3 for the Random Forest model. (Babakhin 2019)

这是一些业界的思路，不要用截距。

3.7 相关函数展示

dir_tree("../R")

## ../R
## +-- create_level.R
## +-- evaluate_model.R
## +-- get_level_mean.R
## +-- load.R
## \-- path_md_fence.R

dir_ls("../R") %>%
  purrr::map(path_md_fence)

$../R/create_level.R

create_level <- function(x) {
    1:x %>% 
        str_pad(2,pad = "0") %>% 
        paste0("n_", .)
}

$../R/evaluate_model.R

# function to calculate the rmse
rmse <- function(ypred, y) {
    resid = y - ypred
    sqrt(mean(resid ^ 2))
}

# function to calculate R-squared
rsquared <- function(ypred, y) {
    null_variance = sum((y - mean(y)) ^ 2)
    resid_variance = sum((y - ypred) ^ 2)
    1 - (resid_variance / null_variance)
}
compare_models <- function(predframe) {
    predictions = setdiff(colnames(predframe), "y")
    data.frame(
        # pred_type = predictions,
        rmse = vapply(
            predframe[, predictions, drop = FALSE],
            FUN = function(p)
                rmse(p, predframe$y),
            numeric(1)
        ),
        rsq_test = vapply(
            predframe[, predictions, drop = FALSE],
            FUN = function(p)
                rsquared(p, predframe$y),
            numeric(1)
        )
    )
}

$../R/get_level_mean.R

get_level_mean <- function(dtrain, col) {
    xs_means <- dtrain %>%
        mutate(delta = y - mean(y)) %>%
        group_by({{col}}) %>%
        summarise(meany = mean(y),
                  coeff = mean(delta)) %>%
        arrange({{col}})
    return(xs_means)
}

$../R/load.R

# library -----------------------------------------------------------------

library(tidyverse)
library(fs)
library(glue)
library(rio)
library(here)

## rmarkdown -------------------------------------------------------------

library(knitr)
library(kableExtra)
library(markdown)



# souce custom func -------------------------------------------------------

# here("R") %>% dir_info() %>% 
#     filter(!path %>% str_detect('load.R')) %>% 
#     mutate(
#         source = map(path,source)
#     )

$../R/path_md_fence.R

#' @importFrom magrittr %>%
# library(magrittr)
path_md_fence <- function(path) {
    readr::read_file(path) %>%
        gluedown::md_fence()
}

4 Python 实现示例

%store -r regular_data

regular_data.shape

(175008, 35)

boxscore_cols = [
            'T1_FGM', 'T1_FGA', 'T1_FGM3', 'T1_FGA3', 'T1_OR', 'T1_Ast', 'T1_TO', 'T1_Stl', 'T1_PF', 
            'T2_FGM', 'T2_FGA', 'T2_FGM3', 'T2_FGA3', 'T2_OR', 'T2_Ast', 'T2_TO', 'T2_Stl', 'T2_Blk',  
            'PointDiff']

from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=123)
import numpy as np
import pandas as pd

cv_test_list = []
funcs = [np.mean]
fold = 0
for train_index, test_index in kf.split(regular_data):
    cv_train, cv_test = regular_data.iloc[train_index], regular_data.iloc[test_index]
    
    season_statistics = cv_train.groupby(["Season", 'T1_TeamID'])[boxscore_cols].agg(funcs).reset_index()
    season_statistics.columns = [''.join(col).strip() for col in season_statistics.columns.values]

    season_statistics_T1 = season_statistics.copy()
    season_statistics_T2 = season_statistics.copy()
    season_statistics_T1.columns = ["T1_" + x.replace("T1_","").replace("T2_","opponent_") for x in list(season_statistics_T1.columns)]
    season_statistics_T2.columns = ["T2_" + x.replace("T1_","").replace("T2_","opponent_") for x in list(season_statistics_T2.columns)]
    season_statistics_T1.columns.values[0] = "Season"
    season_statistics_T2.columns.values[0] = "Season"

    cv_test =pd.merge(cv_test, season_statistics_T1, on = ['Season', 'T1_TeamID'], how = 'left')
    cv_test =pd.merge(cv_test, season_statistics_T2, on = ['Season', 'T2_TeamID'], how = 'left')
    
    cv_test_list.append(cv_test)
    
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    fold += 1

Fold: 0
CV train shape: (140006, 35)
Fold: 1
CV train shape: (140006, 35)
Fold: 2
CV train shape: (140006, 35)
Fold: 3
CV train shape: (140007, 35)
Fold: 4
CV train shape: (140007, 35)

len(cv_test_list)

cv_test_list[0].__class__

pandas.core.frame.DataFrame

for (idx, i) in enumerate(cv_test_list):
    print('# ', idx)
    print(i.__class__)
    print(i.shape)

#  0
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  1
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  2
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  3
<class 'pandas.core.frame.DataFrame'>
(35001, 73)
#  4
<class 'pandas.core.frame.DataFrame'>
(35001, 73)

cv_test_df = pd.concat(cv_test_list)

cv_test_df.head()

	Season	DayNum	T1_TeamID	T1_Score	T2_TeamID	T2_Score	location	T1_FGM	T1_FGA	…	T2_opponent_FGMmean	T2_opponent_FGAmean	T2_opponent_FGM3mean	T2_opponent_FGA3mean	T2_opponent_ORmean	T2_opponent_Astmean	T2_opponent_TOmean	T2_opponent_Stlmean	T2_opponent_Blkmean	T2_PointDiffmean
0	2003	13	1166	106	1426	50	1	41	69	…	23.708333	54.708333	5.708333	17.458333	10.583333	11.416667	14.041667	7.625000	3.125000	-0.583333
1	2003	14	1353	60	1162	36	1	23	57	…	21.894737	50.842105	6.789474	18.315789	10.315789	12.947368	12.578947	9.263158	3.947368	-15.368421
2	2003	14	1390	61	1131	57	1	20	53	…	20.880000	51.680000	4.880000	16.600000	10.640000	11.920000	13.240000	6.840000	3.120000	5.840000
3	2003	14	1426	59	1106	47	0	25	53	…	22.043478	52.608696	4.869565	15.652174	11.130435	12.434783	14.782609	8.521739	3.043478	-1.130435
4	2003	18	1113	59	1287	56	1	22	54	…	28.157895	58.894737	7.052632	19.631579	12.000000	16.263158	16.052632	5.789474	3.473684	2.578947

5 rows × 73 columns

cv_test_df.shape

(175008, 73)

附录

参考文献

Babakhin, Yauhen. 2019. “Winning a Kaggle Competition in Python.” DataCamp. 2019. https://www.datacamp.com/courses/winning-a-kaggle-competition-in-python.

Zumel, Nina. 2019. “When Cross-Validation Is More Powerful Than Regularization.” Win-Vector Blog. 2019. http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/.

潘华引. 2019. “一文详尽系列之CatBoost.” Datawhale. 2019. https://mp.weixin.qq.com/s/E3pSPsG18053F5GG1Z8jNQ.

Target Encoding 学习笔记