1. 使用 RMarkdown 的 child 参数,进行文档拼接。
  2. 这样拼接以后的笔记方便复习。
  3. 相关问题提交到 Issue

1 Target statistics

Target statistics(潘华引 2019) 也是模块化,也是 impact code(Zumel 2019),也是 target encoding(Babakhin 2019)

\[\text{Greedy TS}= \mathbb{E}\left(y | x^{i}=x_{k}^{i}\right)\]

\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}}}{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\}}}\]

就是不处理 global mean 的 impact code

为了处理过拟合的情况(target leakage,目标穿越),加入正则化处理。

\[\hat{x}_{k}^{i}=\frac{\sum_{j=1}^{n} \mathbb{I}{\left\{x_{j}^{i}=x_{k}^{i}\right\} \cdot y_{i}+a p}}{\sum_{j=1}^{n} \mathbb{I}_{\left\{x{j}^{i}=x_{k}^{i}\right\}}+a}\]

其中是先验概率的权重,而对于先验概率,通常的做法是设置为数据集当中目标变量的平均值(潘华引 2019)。 因此这是一种 weight rating 的处理方式。

但是我们知道这种偏正则化的处理方式是效果不明显的(Zumel 2019),因此需要使用交叉验证相关的方法,这里引入 Holdout TS。

2 主要理论

The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit. (Zumel 2019)

beta 越大,越受到 train data 的影响。

Cross-validation is a way to safely reuse training data in nested model situations. (Zumel 2019)

交叉验证的优点

  1. safety 才是关键
  2. reuse 才是关键

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias. (Zumel 2019)

超参数的学习和训练要分开。 这就是预训练模型的操作方式。 也就是说 target encoding 和把预训练后的规则去跑模型是两方面。

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation? (Zumel 2019) The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.” (Zumel 2019)

cv 的计算更加耗时。

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models. (Zumel 2019)

这是 target encoding 的方法。

Impact coding 的方式。

specifically for dealing with high-cardinality categorical variables (Zumel 2019)

当变量是有排序的,那么是可以这么处理,不是无序的分类变量。

Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit. (Zumel 2019)

这里证明了乱用 target encoding,聚合的变量就会很显著,这样的话是过拟合的,是误导。

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. (Zumel 2019)

这里提出来 cv 来搞 target encoding。

We can compare the performance of this model to the naive model on holdout data. (Zumel 2019)

这样的话,cat 随机变量没有显著。 而且 R^2 还提高了,在测试集上。

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean. (Zumel 2019)

\[v_{i}=\Sigma_{x_{j}=s_{i}} y_{i} /\left(\operatorname{count}_{i}+\lambda\right)-E\left[y_{i}\right]\]

正则化就是拉普拉斯平滑。

http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.(Zumel 2019)

可以证明 target encoding 和 stacking 的正确做法。

3 实现代码

参考 GitHub

我们发现,不正确 target encoding 处理方式容易产生过拟合。 处理过拟合常用的是

  1. 交叉验证 (这就是正确的 target encoding 方式)
  2. 正则化

我们将尝试这两种方法,以产出正确的处理过拟合的方式。

3.1 生成数据

## 
## n_01 n_02 n_03 n_04 n_05 n_06 n_07 n_08 n_09 n_10 
##   22   27   34   18   26   23   28   22   24   22
## 
##  n_01  n_02  n_04  n_05  n_06  n_07  n_08  n_09  n_10 n_100  n_11  n_12  n_13 
##     3     3     2     1     2     3     2     3     1     1     3     2     1 
##  n_14  n_15  n_16  n_17  n_18  n_19  n_20  n_21  n_22  n_23  n_24  n_25  n_26 
##     4     2     4     1     2     2     5     1     3     3     6     4     7 
##  n_27  n_29  n_30  n_31  n_33  n_34  n_35  n_36  n_37  n_38  n_39  n_40  n_41 
##     3     2     4     3     3     3     4     5     3     2     3     2     3 
##  n_42  n_44  n_45  n_46  n_47  n_48  n_50  n_51  n_52  n_53  n_54  n_55  n_56 
##     6     4     1     3     1     2     1     1     3     4     3     3     5 
##  n_57  n_58  n_59  n_61  n_62  n_63  n_65  n_66  n_67  n_68  n_71  n_72  n_73 
##     2     1     3     4     3     3     4     6     2     4     3     1     4 
##  n_74  n_75  n_76  n_77  n_78  n_79  n_80  n_81  n_82  n_83  n_84  n_85  n_86 
##     3     3     4     3     2     2     3     3     2     1     2     1     2 
##  n_87  n_88  n_89  n_90  n_91  n_92  n_93  n_94  n_95  n_98  n_99 
##     2     4     2     1     3     1     2     3     3     2     4
##  [1] "我" "他" "我" "你" "他" "我" "他" "他" "他" "他"

3.2 直接用训练集

y - mean(y)是 target encoding 的处理方法。

\[v_i = E[y | x = s_i] - E[y]\]

\(s_i\)指的是第\(i\)个 level.

meany衡量每个 level 的y均值,coeff 是减去了 y 的 global mean。

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9782 -0.5588  0.0022  0.5574  2.4658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01023    0.05642  -0.181    0.856    
## vs           0.78055    0.06132  12.730  < 2e-16 ***
## vn           0.61687    0.07202   8.565 1.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8844 on 243 degrees of freedom
## Multiple R-squared:  0.6195, Adjusted R-squared:  0.6163 
## F-statistic: 197.8 on 2 and 243 DF,  p-value: < 2.2e-16

比较 train 和 test 组,R^2 可以体现过拟合。

3.3 交叉验证

## 
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85533 -0.70304 -0.04544  0.69367  2.98327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.013627   0.066773  -0.204    0.838    
## x_s_catN     0.963333   0.065729  14.656   <2e-16 ***
## x_n_catN     0.006362   0.062030   0.103    0.918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.044 on 243 degrees of freedom
## Multiple R-squared:  0.4695, Adjusted R-squared:  0.4651 
## F-statistic: 107.5 on 2 and 243 DF,  p-value: < 2.2e-16

\(R_{\text{train}}^2\)变小了 \(R_{\text{test}}^2\)也不会高,所以过拟合减少。

希望 \(R_{\text{test}}^2\) 变高。

更高的 test 组 R^2 并且过拟合不明显,那么就说明 cv 更好。

3.4 正则化

For a one-variable model, L2-regularization is simply Laplace smoothing.

L2 正则化类似于拉普拉斯变换。

\[v_i = \sum\nolimits_{x_j = s_i} y_i/(\text{count}_i + \lambda) - E[y_i]\]

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0167 -0.5639  0.0159  0.5472  2.4304 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02210    0.05658  -0.391    0.696    
## vs           0.81520    0.06389  12.760  < 2e-16 ***
## vn           0.80144    0.09495   8.441 2.86e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8873 on 243 degrees of freedom
## Multiple R-squared:  0.6169, Adjusted R-squared:  0.6138 
## F-statistic: 195.7 on 2 and 243 DF,  p-value: < 2.2e-16

可以发现 L2 正则化收效甚微。

3.5 正则化 2.0

## 
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01570 -0.57022 -0.00065  0.57939  2.49627 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.40213    0.07306  -5.504 9.40e-08 ***
## vs           3.98571    0.30395  13.113  < 2e-16 ***
## vn          14.84943    1.95003   7.615 5.84e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.909 on 243 degrees of freedom
## Multiple R-squared:  0.598,  Adjusted R-squared:  0.5947 
## F-statistic: 180.7 on 2 and 243 DF,  p-value: < 2.2e-16

可以发现 L2 正则化增大了,收效好一些,但是还是不如 cv。

3.6 总结

不正确的使用 target encoding 存在过拟合,但是目前发现正则化收效甚微,直接用CV吧。

这个可以用 stacking 来理解。 stacking 中要求同一个训练样本不要重复使用,是指的是不要再两层,同一层可以重复调用。

因此这里 target encoding 给出的预测值,和最后使用 target encoding 的变量给出的预测值,原则上不能是同一样本。 这里进行 CV 处理后,target encoding 使用的样本和最后训练模型,使用的样本就没用那么高度重合了。

进一步可以理解 stacking 了。

  1. Split train data into two parts
  2. Train multiple models on Part 1
  3. Make predictions on Part 2
  4. Make predictions on the test data
  5. Train a new model on Part 2 using predictions as features
  6. Make predictions on the test data using the 2nd level model
  1. data => train + hold-out
  2. train => p1 + p2
  3. p1 训练 1st-level 模型,并且给出 p2 和 hold-out 组上的预测,之后 p2 预测值用于训练,hold-out 预测值用于 2nd-level 模型预测
  4. p2 训练 2nd-level 模型,也叫做 meta-model,在 hold-out 上进行预测

Usually, the 2nd level model is some simple model like Linear or Logistic Regressions. Also, note that you were not using intercept in the Linear Regression just to combine pure model predictions. Looking at the coefficients, it’s clear that 2nd level model has more trust in the Gradient Boosting: 0.7 versus 0.3 for the Random Forest model. (Babakhin 2019)

这是一些业界的思路,不要用截距。

3.7 相关函数展示

## ../R
## +-- create_level.R
## +-- evaluate_model.R
## +-- get_level_mean.R
## +-- load.R
## \-- path_md_fence.R

$../R/create_level.R

$../R/evaluate_model.R

$../R/get_level_mean.R

$../R/load.R

$../R/path_md_fence.R

4 Python 实现示例

(175008, 35)
Fold: 0
CV train shape: (140006, 35)
Fold: 1
CV train shape: (140006, 35)
Fold: 2
CV train shape: (140006, 35)
Fold: 3
CV train shape: (140007, 35)
Fold: 4
CV train shape: (140007, 35)
5
pandas.core.frame.DataFrame
#  0
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  1
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  2
<class 'pandas.core.frame.DataFrame'>
(35002, 73)
#  3
<class 'pandas.core.frame.DataFrame'>
(35001, 73)
#  4
<class 'pandas.core.frame.DataFrame'>
(35001, 73)
Season DayNum T1_TeamID T1_Score T2_TeamID T2_Score location NumOT T1_FGM T1_FGA T2_opponent_FGMmean T2_opponent_FGAmean T2_opponent_FGM3mean T2_opponent_FGA3mean T2_opponent_ORmean T2_opponent_Astmean T2_opponent_TOmean T2_opponent_Stlmean T2_opponent_Blkmean T2_PointDiffmean
0 2003 13 1166 106 1426 50 1 0 41 69 23.708333 54.708333 5.708333 17.458333 10.583333 11.416667 14.041667 7.625000 3.125000 -0.583333
1 2003 14 1353 60 1162 36 1 0 23 57 21.894737 50.842105 6.789474 18.315789 10.315789 12.947368 12.578947 9.263158 3.947368 -15.368421
2 2003 14 1390 61 1131 57 1 0 20 53 20.880000 51.680000 4.880000 16.600000 10.640000 11.920000 13.240000 6.840000 3.120000 5.840000
3 2003 14 1426 59 1106 47 0 0 25 53 22.043478 52.608696 4.869565 15.652174 11.130435 12.434783 14.782609 8.521739 3.043478 -1.130435
4 2003 18 1113 59 1287 56 1 0 22 54 28.157895 58.894737 7.052632 19.631579 12.000000 16.263158 16.052632 5.789474 3.473684 2.578947

5 rows × 73 columns

(175008, 73)

附录

参考文献

Babakhin, Yauhen. 2019. “Winning a Kaggle Competition in Python.” DataCamp. 2019. https://www.datacamp.com/courses/winning-a-kaggle-competition-in-python.

Zumel, Nina. 2019. “When Cross-Validation Is More Powerful Than Regularization.” Win-Vector Blog. 2019. http://www.win-vector.com/blog/2019/11/when-cross-validation-is-more-powerful-than-regularization/.

潘华引. 2019. “一文详尽系列之CatBoost.” Datawhale. 2019. https://mp.weixin.qq.com/s/E3pSPsG18053F5GG1Z8jNQ.