SmartEDA[@SmartEDA]基本上把一个数据表内的数据展示的情况全部都展示了, 提高效率,以后每个数据表都可以进行一次数据展示。 本文参考 @SmartEDAvignettes 给出的小品文。

knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=T)
library(SmartEDA)
library(tidyverse)
library(ISLR)
library(formattable)
library(knitr)
library(DT)
Carseats <- ISLR::Carseats

总体展示

使用函数ExpData

ExpData(data = Carseats, type = 1) %>% kable()
Descriptions Obs
Sample size (Nrow) 400
No. of Variables (Ncol) 11
No. of Numeric Variables 8
No. of Factor Variables 3
No. of Text Variables 0
No. of Logical Variables 0
No. of Date Variables 0
No. of Zero variance Variables (Uniform) 0
%. of Variables having complete cases 100% (11)
%. of Variables having <50% missing cases 0% (0)
%. of Variables having >50% missing cases 0% (0)
%. of Variables having >90% missing cases 0% (0)
ExpData(data = Carseats, type = 2) %>% kable()
S.no Variable Name Variable Type % of Missing No. of Unique values
1 Sales numeric 0 336
2 CompPrice numeric 0 73
3 Income numeric 0 98
4 Advertising numeric 0 28
5 Population numeric 0 275
6 Price numeric 0 101
7 ShelveLoc* factor 0 3
8 Age numeric 0 56
9 Education numeric 0 9
10 Urban* factor 0 2
11 US* factor 0 2

type=1type=2分别按照简易两种方式展示。

探索式分析

按照分类变量和连续变量分别展示, 每一种分别按照是否监督、监督变量类别,出数据表、图。

连续变量

table

ExpNumStat(
    Carseats,
    by = "A",
    gp = NULL,
    Qnt = seq(0, 1, 0.1),
    MesofShape = 2,
    Outlier = TRUE,
    round = 2
) %>%
    mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname Group TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing sum min max mean median SD CV IQR Skewness Kurtosis 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% LB.25% UB.75% nOutliers
Advertising All 400 0 144 256 0 0 0 0.00% 2654.00 0 29.00 6.63 5.00 6.65 1.00 12.00 0.64 -0.55 0 0.00 0.00 0.00 2.00 5.00 8.40 11.00 13.00 16.0 29.00 -18.00 30.00 0
Age All 400 0 0 400 0 0 0 0.00% 21329.00 25 80.00 53.32 54.50 16.20 0.30 26.25 -0.08 -1.14 25 30.00 36.00 43.00 48.60 54.50 60.00 64.00 70.00 76.0 80.00 0.38 105.38 0
CompPrice All 400 0 0 400 0 0 0 0.00% 49990.00 77 175.00 124.97 125.00 15.33 0.12 20.00 -0.04 0.03 77 106.00 112.80 117.00 121.00 125.00 130.00 133.00 137.20 145.0 175.00 85.00 165.00 2
Income All 400 0 0 400 0 0 0 0.00% 27463.00 21 120.00 68.66 69.00 27.99 0.41 48.25 0.05 -1.09 21 30.00 39.00 48.00 62.00 69.00 77.00 84.60 96.20 107.0 120.00 -29.62 163.38 0
Population All 400 0 0 400 0 0 0 0.00% 105936.00 10 509.00 264.84 272.00 147.38 0.56 259.50 -0.05 -1.20 10 58.90 110.40 160.00 218.60 272.00 317.80 366.00 412.20 467.0 509.00 -250.25 787.75 0
Price All 400 0 0 400 0 0 0 0.00% 46318.00 24 191.00 115.80 117.00 23.68 0.20 31.00 -0.12 0.43 24 87.00 96.80 104.00 110.00 117.00 122.00 128.30 134.00 146.0 191.00 53.50 177.50 5
Sales All 400 0 1 399 0 0 0 0.00% 2998.53 0 16.27 7.50 7.49 2.82 0.38 3.93 0.18 -0.10 0 4.12 5.07 5.87 6.61 7.49 8.08 8.82 9.76 11.3 16.27 -0.50 15.21 2
ExpNumStat(
    Carseats,
    by = "A",
    gp = "Price",
    Qnt = seq(0, 1, 0.1),
    MesofShape = 2,
    Outlier = TRUE,
    round = 2
) %>%
    mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname Group Note TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing sum min max mean median SD CV IQR Skewness Kurtosis 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% LB.25% UB.75% nOutliers cor
Advertising Price Cor b/w Price 400 0 144 256 0 0 0 0.00% 2654.00 0 29.00 6.63 5.00 6.65 1.00 12.00 0.64 -0.55 0 0.00 0.00 0.00 2.00 5.00 8.40 11.00 13.00 16.0 29.00 -18.00 30.00 0 0.04
Age Price Cor b/w Price 400 0 0 400 0 0 0 0.00% 21329.00 25 80.00 53.32 54.50 16.20 0.30 26.25 -0.08 -1.14 25 30.00 36.00 43.00 48.60 54.50 60.00 64.00 70.00 76.0 80.00 0.38 105.38 0 -0.10
CompPrice Price Cor b/w Price 400 0 0 400 0 0 0 0.00% 49990.00 77 175.00 124.97 125.00 15.33 0.12 20.00 -0.04 0.03 77 106.00 112.80 117.00 121.00 125.00 130.00 133.00 137.20 145.0 175.00 85.00 165.00 2 0.58
Income Price Cor b/w Price 400 0 0 400 0 0 0 0.00% 27463.00 21 120.00 68.66 69.00 27.99 0.41 48.25 0.05 -1.09 21 30.00 39.00 48.00 62.00 69.00 77.00 84.60 96.20 107.0 120.00 -29.62 163.38 0 -0.06
Population Price Cor b/w Price 400 0 0 400 0 0 0 0.00% 105936.00 10 509.00 264.84 272.00 147.38 0.56 259.50 -0.05 -1.20 10 58.90 110.40 160.00 218.60 272.00 317.80 366.00 412.20 467.0 509.00 -250.25 787.75 0 -0.01
Price Price Cor b/w Price 400 0 0 400 0 0 0 0.00% 46318.00 24 191.00 115.80 117.00 23.68 0.20 31.00 -0.12 0.43 24 87.00 96.80 104.00 110.00 117.00 122.00 128.30 134.00 146.0 191.00 53.50 177.50 5 1.00
Sales Price Cor b/w Price 400 0 1 399 0 0 0 0.00% 2998.53 0 16.27 7.50 7.49 2.82 0.38 3.93 0.18 -0.10 0 4.12 5.07 5.87 6.61 7.49 8.08 8.82 9.76 11.3 16.27 -0.50 15.21 2 -0.44
ExpNumStat(
    Carseats,
    by = "GA",
    gp = "Urban",
    Qnt = seq(0, 1, 0.1),
    MesofShape = 2,
    Outlier = TRUE,
    round = 2
) %>%     mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname Group TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing sum min max mean median SD CV IQR Skewness Kurtosis 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% LB.25% UB.75% nOutliers
Advertising Urban:All 400 0 144 256 0 0 0 0.00% 2654.00 0.00 29.00 6.63 5.00 6.65 1.00 12.00 0.64 -0.55 0.00 0.00 0.00 0.00 2.00 5.00 8.40 11.00 13.00 16.00 29.00 -18.00 30.00 0
Advertising Urban:Yes 282 0 94 188 0 0 0 0.00% 1922.00 0.00 29.00 6.82 5.00 6.73 0.99 12.00 0.66 -0.49 0.00 0.00 0.00 0.00 3.00 5.00 8.00 11.00 13.00 16.00 29.00 -18.00 30.00 0
Advertising Urban:No 118 0 50 68 0 0 0 0.00% 732.00 0.00 24.00 6.20 5.00 6.47 1.04 11.00 0.56 -0.80 0.00 0.00 0.00 0.00 0.00 5.00 9.00 10.90 12.00 14.00 24.00 -16.50 27.50 0
Age Urban:All 400 0 0 400 0 0 0 0.00% 21329.00 25.00 80.00 53.32 54.50 16.20 0.30 26.25 -0.08 -1.14 25.00 30.00 36.00 43.00 48.60 54.50 60.00 64.00 70.00 76.00 80.00 0.38 105.38 0
Age Urban:Yes 282 0 0 282 0 0 0 0.00% 15121.00 25.00 80.00 53.62 55.00 15.98 0.30 24.75 -0.10 -1.10 25.00 32.00 37.00 43.00 50.00 55.00 61.00 64.00 69.00 75.00 80.00 3.88 102.88 0
Age Urban:No 118 0 0 118 0 0 0 0.00% 6208.00 25.00 80.00 52.61 53.50 16.76 0.32 28.00 -0.01 -1.20 25.00 29.00 36.00 41.00 47.00 53.50 57.00 63.90 71.60 76.00 80.00 -3.00 109.00 0
CompPrice Urban:All 400 0 0 400 0 0 0 0.00% 49990.00 77.00 175.00 124.97 125.00 15.33 0.12 20.00 -0.04 0.03 77.00 106.00 112.80 117.00 121.00 125.00 130.00 133.00 137.20 145.00 175.00 85.00 165.00 2
CompPrice Urban:Yes 282 0 0 282 0 0 0 0.00% 35429.00 77.00 175.00 125.63 125.00 15.70 0.12 20.00 -0.04 0.14 77.00 106.00 113.00 117.00 122.00 125.00 130.00 133.70 138.00 145.90 175.00 86.00 166.00 3
CompPrice Urban:No 118 0 0 118 0 0 0 0.00% 14561.00 88.00 154.00 123.40 123.00 14.36 0.12 20.25 -0.10 -0.47 88.00 106.00 111.00 115.10 120.80 123.00 128.00 132.00 135.00 142.30 154.00 82.88 163.88 0
Income Urban:All 400 0 0 400 0 0 0 0.00% 27463.00 21.00 120.00 68.66 69.00 27.99 0.41 48.25 0.05 -1.09 21.00 30.00 39.00 48.00 62.00 69.00 77.00 84.60 96.20 107.00 120.00 -29.62 163.38 0
Income Urban:Yes 282 0 0 282 0 0 0 0.00% 19555.00 21.00 120.00 69.34 70.00 27.27 0.39 47.00 0.01 -1.07 21.00 32.00 41.00 52.00 63.00 70.00 79.00 87.00 96.00 105.00 120.00 -26.50 161.50 0
Income Urban:No 118 0 0 118 0 0 0 0.00% 7908.00 21.00 120.00 67.02 67.00 29.68 0.44 49.75 0.14 -1.12 21.00 26.70 36.00 46.00 59.80 67.00 73.40 82.00 99.80 110.30 120.00 -34.38 164.62 0
Population Urban:All 400 0 0 400 0 0 0 0.00% 105936.00 10.00 509.00 264.84 272.00 147.38 0.56 259.50 -0.05 -1.20 10.00 58.90 110.40 160.00 218.60 272.00 317.80 366.00 412.20 467.00 509.00 -250.25 787.75 0
Population Urban:Yes 282 0 0 282 0 0 0 0.00% 73288.00 12.00 508.00 259.89 274.00 144.61 0.56 237.75 -0.04 -1.17 12.00 60.00 105.00 156.60 217.20 274.00 315.60 353.70 405.60 453.50 508.00 -217.62 733.38 0
Population Urban:No 118 0 0 118 0 0 0 0.00% 32648.00 10.00 509.00 276.68 271.00 153.76 0.56 264.00 -0.09 -1.27 10.00 56.10 122.40 176.50 219.60 271.00 344.00 400.00 440.20 480.30 509.00 -252.00 804.00 0
Price Urban:All 400 0 0 400 0 0 0 0.00% 46318.00 24.00 191.00 115.80 117.00 23.68 0.20 31.00 -0.12 0.43 24.00 87.00 96.80 104.00 110.00 117.00 122.00 128.30 134.00 146.00 191.00 53.50 177.50 5
Price Urban:Yes 282 0 0 282 0 0 0 0.00% 32857.00 24.00 191.00 116.51 118.00 23.94 0.21 30.75 -0.23 0.55 24.00 87.00 97.00 104.30 110.00 118.00 124.00 129.00 134.00 146.90 191.00 54.88 177.88 4
Price Urban:No 118 0 0 118 0 0 0 0.00% 13461.00 49.00 185.00 114.08 112.50 23.04 0.20 31.00 0.13 0.22 49.00 88.70 96.00 101.20 106.80 112.50 120.00 125.90 131.60 144.00 185.00 50.50 174.50 2
Sales Urban:All 400 0 1 399 0 0 0 0.00% 2998.53 0.00 16.27 7.50 7.49 2.82 0.38 3.93 0.18 -0.10 0.00 4.12 5.07 5.87 6.61 7.49 8.08 8.82 9.76 11.30 16.27 -0.50 15.21 2
Sales Urban:Yes 282 0 0 282 0 0 0 0.00% 2106.03 0.37 16.27 7.47 7.42 2.84 0.38 3.92 0.27 -0.10 0.37 4.10 5.04 5.87 6.57 7.42 7.91 8.77 9.71 11.48 16.27 -0.50 15.17 2
Sales Urban:No 118 0 1 117 0 0 0 0.00% 892.50 0.00 14.90 7.56 7.67 2.81 0.37 3.90 -0.03 -0.06 0.00 4.30 5.14 5.84 6.87 7.67 8.49 8.98 9.77 11.11 14.90 -0.40 15.18 0
  • Vname – Variable name - 变量名称
  • Group – Target variable -
  • TN – Total sample (inculded NA observations) - 样本总数
  • nNeg – Total negative observations - 负样本数量
  • nZero – Total zero observations - 零值数量
  • nPos – Total positive observations - 正样本数量
  • NegInf – Negative infinite count - 负无穷大极值
  • PosInf – Positive infinite count - 正无穷大极值
  • NA_value – Not Applicable count - 缺失值
  • Per_of_Missing – Percentage of missings - 缺失率
  • Min – minimum value - 最小值
  • Max – maximum value - 最大值
  • Mean – average value - 平均值
  • Median – median value - 中位数
  • SD – Standard deviation - 总体标准差
  • CV – coefficient of variations (SD/mean)*100 - z score
  • IQR – Inter quartile range - 四分位距 \(QD = \frac{Q3-Q1}{2}\)
  • Qnt – Specified quantiles - 百分位点
  • MesofShape – Skewness and Kurtosis - 偏度和峰度
  • Outlier – Number of outliers - 异常值数量
  • Cor – Correlation b/w target and independent variables - 自变量和因变量相关性

plot

ExpNumViz(Carseats,
          nlim = 10,
          Page = c(2, 4),
          sample = NULL)
## $`0`

  • nlim: 少于10个样本不画图
  • sample: 随机选择变量进行展示,NULL表示全部展示
ExpNumViz(
    Carseats,
    target = "Price",
    Page = c(2, 4)
)
## $`0`

ExpNumViz(
    Carseats,
    target = "Urban",
    col = c("pink", "yellow", "orange"),
    Page = c(2, 4)
)
## $`0`

分类变量

table

ExpCTable(Carseats) %>% kable()
Variable Valid Frequency Percent CumPercent
ShelveLoc Bad 96 24.00 24.00
ShelveLoc Good 85 21.25 45.25
ShelveLoc Medium 219 54.75 100.00
ShelveLoc TOTAL 400 NA NA
Urban No 118 29.50 29.50
Urban Yes 282 70.50 100.00
Urban TOTAL 400 NA NA
US No 142 35.50 35.50
US Yes 258 64.50 100.00
US TOTAL 400 NA NA
Education 10 48 12.00 12.00
Education 11 48 12.00 24.00
Education 12 49 12.25 36.25
Education 13 43 10.75 47.00
Education 14 40 10.00 57.00
Education 15 36 9.00 66.00
Education 16 47 11.75 77.75
Education 17 49 12.25 90.00
Education 18 40 10.00 100.00
Education TOTAL 400 NA NA
ExpCTable(
    Carseats,
    Target = "Price"
) %>% kable()
VARIABLE CATEGORY Price:(23.8,79.7] Price:(79.7,135] Price:(135,191] TOTAL
ShelveLoc Bad 6 75 15 96
ShelveLoc Good 7 61 17 85
ShelveLoc Medium 11 170 38 219
ShelveLoc TOTAL 24 306 70 400
Urban No 7 90 21 118
Urban Yes 17 216 49 282
Urban TOTAL 24 306 70 400
US No 10 105 27 142
US Yes 14 201 43 258
US TOTAL 24 306 70 400
Education 10 2 37 9 48
Education 11 1 41 6 48
Education 12 3 39 7 49
Education 13 1 31 11 43
Education 14 1 34 5 40
Education 15 6 23 7 36
Education 16 3 39 5 47
Education 17 3 35 11 49
Education 18 4 27 9 40
Education TOTAL 24 306 70 400
ExpCTable(
    Carseats,
    Target = "Urban"
) %>% kable()
VARIABLE CATEGORY Urban:No Urban:Yes TOTAL
ShelveLoc Bad 22 74 96
ShelveLoc Good 28 57 85
ShelveLoc Medium 68 151 219
ShelveLoc TOTAL 118 282 400
US No 46 96 142
US Yes 72 186 258
US TOTAL 118 282 400
Education 10 14 34 48
Education 11 15 33 48
Education 12 9 40 49
Education 13 12 31 43
Education 14 15 25 40
Education 15 10 26 36
Education 16 15 32 47
Education 17 17 32 49
Education 18 11 29 40
Education TOTAL 118 282 400
ExpCTable(
    mtcars,
    Target = "gear",
    margin = 1,
    clim = 10,
    nlim = 3,
    bin = NULL,
    per = FALSE
)
##    VARIABLE CATEGORY gear:3 gear:4 gear:5 TOTAL
## 1       cyl        4      1      8      2    11
## 2       cyl        6      2      4      1     7
## 3       cyl        8     12      0      2    14
## 4       cyl    TOTAL     15     12      5    32
## 5        vs        0     12      2      4    18
## 6        vs        1      3     10      1    14
## 7        vs    TOTAL     15     12      5    32
## 8        am        0     15      4      0    19
## 9        am        1      0      8      5    13
## 10       am    TOTAL     15     12      5    32
## 11     gear        3     15      0      0    15
## 12     gear        4      0     12      0    12
## 13     gear        5      0      0      5     5
## 14     gear    TOTAL     15     12      5    32

plot

ExpCatViz(
    Carseats,
    target = NULL,
    fname = NULL,
    clim = 10,
    margin = 2,
    Page = c(2, 1),
    sample = NULL
)
## $`0`

ExpCatViz(
    Carseats,
    target = "Urban",
    fname = NULL,
    clim = 10,
    col = NULL,
    margin = 2,
    Page = c(2, 1),
    sample = 2
)
## $`0`

WOE和IV值分析

计算WOE值的函数

Carseats %>% distinct(Urban)
##   Urban
## 1   Yes
## 2    No
Carseats %>% names
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
ExpCatStat(
    Carseats,
    Target = "Urban",
    result = "IV",
    clim = 10,
    nlim = 5,
    Pclass = "Yes",
    plot = TRUE
) %>% kable()
Variable Class Out_1 Out_0 TOTAL Per_1 Per_0 Odds WOE IV Ref_1 Ref_0
ShelveLoc.1 Bad 74 22 96 0.26 0.19 0.85 0.31 0.02 Yes No
ShelveLoc.2 Good 57 28 85 0.20 0.24 0.81 -0.19 0.01 Yes No
ShelveLoc.3 Medium 151 68 219 0.54 0.58 1.55 -0.07 0.00 Yes No
US.1 No 96 46 142 0.34 0.39 1.24 -0.14 0.01 Yes No
US.2 Yes 186 72 258 0.66 0.61 0.81 0.08 0.00 Yes No
Sales.1 [0,4.11] 29 11 40 0.10 0.09 1.29 0.10 0.00 Yes No
Sales.2 (4.11,5.05] 29 11 40 0.10 0.09 0.75 0.10 0.00 Yes No
Sales.3 (5.05,5.86] 26 14 40 0.09 0.12 0.75 -0.29 0.01 Yes No
Sales.4 (5.86,6.59] 30 10 40 0.11 0.08 0.59 0.32 0.01 Yes No
Sales.5 (6.59,7.49] 32 9 41 0.11 0.08 1.44 0.32 0.01 Yes No
Sales.6 (7.49,8.07] 30 9 39 0.11 0.08 1.55 0.32 0.01 Yes No
Sales.7 (8.07,8.8] 24 16 40 0.09 0.14 1.29 -0.45 0.02 Yes No
Sales.8 (8.8,9.71] 26 14 40 0.09 0.12 0.75 -0.29 0.01 Yes No
Sales.9 (9.71,11.3] 26 14 40 0.09 0.12 1.11 -0.29 0.01 Yes No
Sales.10 (11.3,16.3] 30 10 40 0.11 0.08 1.11 0.32 0.01 Yes No
CompPrice.1 [77,106] 30 13 43 0.11 0.11 1.39 0.00 0.00 Yes No
CompPrice.2 (106,112] 22 15 37 0.08 0.13 1.05 -0.48 0.02 Yes No
CompPrice.3 (112,117] 34 11 45 0.12 0.09 1.23 0.29 0.01 Yes No
CompPrice.4 (117,121] 24 13 37 0.09 0.11 0.95 -0.20 0.00 Yes No
CompPrice.5 (121,125] 33 15 48 0.12 0.13 1.24 -0.08 0.00 Yes No
CompPrice.6 (125,130] 29 10 39 0.10 0.08 0.91 0.22 0.00 Yes No
CompPrice.7 (130,133] 25 11 36 0.09 0.09 0.75 0.00 0.00 Yes No
CompPrice.8 (133,137] 26 9 35 0.09 0.08 1.33 0.11 0.00 Yes No
CompPrice.9 (137,145] 30 12 42 0.11 0.10 0.58 0.10 0.00 Yes No
CompPrice.10 (145,175] 29 9 38 0.10 0.08 0.96 0.22 0.00 Yes No
Income.1 [21,30] 24 17 41 0.09 0.14 0.64 -0.45 0.02 Yes No
Income.2 (30,39] 30 11 41 0.11 0.09 1.33 0.20 0.00 Yes No
Income.3 (39,48] 29 10 39 0.10 0.08 1.76 0.22 0.00 Yes No
Income.4 (48,62] 28 15 43 0.10 0.13 1.44 -0.26 0.01 Yes No
Income.5 (62,69] 29 13 42 0.10 0.11 1.05 -0.09 0.00 Yes No
Income.6 (69,77] 25 10 35 0.09 0.08 0.93 0.11 0.00 Yes No
Income.7 (77,84] 30 9 39 0.11 0.08 0.76 0.32 0.01 Yes No
Income.8 (84,96] 32 8 40 0.11 0.07 1.24 0.45 0.02 Yes No
Income.9 (96,107] 31 10 41 0.11 0.08 1.16 0.32 0.01 Yes No
Income.10 (107,120] 24 15 39 0.09 0.13 0.55 -0.37 0.01 Yes No
Advertising.1 [0,2] 108 53 161 0.38 0.45 1.13 -0.17 0.01 Yes No
Advertising.2 (2,5] 37 8 45 0.13 0.07 1.40 0.62 0.04 Yes No
Advertising.3 (5,8] 25 9 34 0.09 0.08 0.95 0.11 0.00 Yes No
Advertising.4 (8,11] 37 20 57 0.13 0.17 0.74 -0.27 0.01 Yes No
Advertising.5 (11,13] 25 11 36 0.09 0.09 1.18 0.00 0.00 Yes No
Advertising.6 (13,16] 26 8 34 0.09 0.07 2.08 0.25 0.00 Yes No
Advertising.7 (16,29] 24 9 33 0.09 0.08 0.76 0.11 0.00 Yes No
Population.1 [10,58] 26 14 40 0.09 0.12 0.57 -0.29 0.01 Yes No
Population.2 (58,108] 33 7 40 0.12 0.06 0.89 0.69 0.04 Yes No
Population.3 (108,160] 29 12 41 0.10 0.10 0.82 0.00 0.00 Yes No
Population.4 (160,218] 25 14 39 0.09 0.12 1.82 -0.29 0.01 Yes No
Population.5 (218,272] 28 13 41 0.10 0.11 1.70 -0.09 0.00 Yes No
Population.6 (272,317] 31 8 39 0.11 0.07 0.89 0.45 0.02 Yes No
Population.7 (317,366] 33 8 41 0.12 0.07 0.72 0.54 0.03 Yes No
Population.8 (366,412] 26 13 39 0.09 0.11 1.01 -0.20 0.00 Yes No
Population.9 (412,467] 28 13 41 0.10 0.11 2.10 -0.09 0.00 Yes No
Population.10 (467,509] 23 16 39 0.08 0.14 0.75 -0.56 0.03 Yes No
Price.1 [24,87] 30 11 41 0.11 0.09 1.24 0.20 0.00 Yes No
Price.2 (87,96] 24 15 39 0.09 0.13 0.78 -0.37 0.01 Yes No
Price.3 (96,104] 31 17 48 0.11 0.14 2.54 -0.24 0.01 Yes No
Price.4 (104,110] 30 8 38 0.11 0.07 0.82 0.45 0.02 Yes No
Price.5 (110,117] 21 14 35 0.07 0.12 1.11 -0.54 0.03 Yes No
Price.6 (117,122] 29 11 40 0.10 0.09 0.60 0.10 0.00 Yes No
Price.7 (122,128] 26 13 39 0.09 0.11 1.64 -0.20 0.00 Yes No
Price.8 (128,134] 39 7 46 0.14 0.06 0.73 0.85 0.07 Yes No
Price.9 (134,146] 23 12 35 0.08 0.10 0.64 -0.22 0.00 Yes No
Price.10 (146,191] 29 10 39 0.10 0.08 1.16 0.22 0.00 Yes No
Age.1 [25,30] 27 16 43 0.10 0.14 1.48 -0.34 0.01 Yes No
Age.2 (30,36] 29 10 39 0.10 0.08 0.61 0.22 0.00 Yes No
Age.3 (36,43] 31 13 44 0.11 0.11 1.46 0.00 0.00 Yes No
Age.4 (43,48] 22 12 34 0.08 0.10 1.95 -0.22 0.00 Yes No
Age.5 (48,54] 29 11 40 0.10 0.09 0.82 0.10 0.00 Yes No
Age.6 (54,60] 30 15 45 0.11 0.13 1.11 -0.16 0.00 Yes No
Age.7 (60,64] 35 8 43 0.12 0.07 0.75 0.54 0.03 Yes No
Age.8 (64,70] 27 8 35 0.10 0.07 1.00 0.36 0.01 Yes No
Age.9 (70,76] 28 18 46 0.10 0.15 1.24 -0.40 0.02 Yes No
Age.10 (76,80] 24 7 31 0.09 0.06 0.68 0.41 0.01 Yes No
Education.1 [10,11] 67 29 96 0.24 0.25 1.11 -0.04 0.00 Yes No
Education.2 (11,12] 40 9 49 0.14 0.08 0.76 0.56 0.03 Yes No
Education.3 (12,13] 31 12 43 0.11 0.10 0.88 0.10 0.00 Yes No
Education.4 (13,14] 25 15 40 0.09 0.13 1.10 -0.37 0.01 Yes No
Education.5 (14,15] 26 10 36 0.09 0.08 0.67 0.11 0.00 Yes No
Education.6 (15,16] 32 15 47 0.11 0.13 1.09 -0.16 0.00 Yes No
Education.7 (16,17] 32 17 49 0.11 0.14 2.00 -0.24 0.01 Yes No
Education.8 (17,18] 29 11 40 0.10 0.09 0.96 0.10 0.00 Yes No
  • Variable – variable name - 变量名称
  • Target – Target variable label - 因变量
  • class – name of bin (variable value otherwise) - 自变量类别
  • out0 – number of good observations - 正样本数量
  • out1 – number of bad observations - 负样本数量
  • Total – Total values for each category - 类别中样本数量
  • pct1 – good observations / total good observations - 正样本比例
  • pct0 – bad observations / total bad observations - 负样本比例
  • odds – pct1/pct0 - 发生比
  • woe – Weight of Evidence – calculated as ln(odds) - WOE
  • iv – Information Value - ln(odds) * (pct0 – pct1) - IV[^IV]

WOE和IV的理解

这里举例计算WOE和IV 1

ShelveLoc变量为例, 一共有三个类别

  • Bad
  • Good
  • Medium

以类别Bad为例, 改subset下,有

  • \(y = 1\)的样本为 72 个
  • \(y = 0\)的样本为 22 个

因此一共有96个样本

样本的正负比例为

Carseats$Urban %>% table()
## .
##  No Yes 
## 118 282

因此

\[\text{Per_1} = \frac{72}{282} = 0.2553191\] \[\text{Per_0} = \frac{22}{118} = 0.1864407\]

因此

\[WOE = \log(\text{Per_1}) - \log(\text{Per_0}) = 0.3144012\]

因此

\[WOE = (\text{Per_1} - \text{Per_0})[\log(\text{Per_1}) - \log(\text{Per_0})] = 0.0216555\]

然后把三个类别的IV都求解出来

最后

\[IV_{\text{ShelveLoc}} = 0.026 + 0.006 + 0.003 = 0.035\]

这个值可以在 @ref(IV-tbl)中得到验证。

计算IV值的函数

ExpCatStat(
    Carseats,
    Target = "Urban",
    result = "Stat",
    clim = 10,
    nlim = 5,
    Pclass = "Yes",
    plot = TRUE
) %>% 
    kable()

Variable Target Unique Chi-squared p-value df IV Value Cramers V Degree of Association Predictive Power
ShelveLoc Urban 3 2.738 0.254 2 0.03 0.08 Very Weak Not Predictive
US Urban 2 0.684 0.408 1 0.01 0.04 Very Weak Not Predictive
Sales Urban 10 6.696 0.669 9 0.09 0.13 Weak Somewhat Predictive
CompPrice Urban 10 4.543 0.872 9 0.03 0.11 Weak Not Predictive
Income Urban 10 8.428 0.492 9 0.08 0.15 Weak Not Predictive
Advertising Urban 7 5.565 0.474 6 0.06 0.12 Weak Not Predictive
Population Urban 10 10.560 0.307 9 0.14 0.16 Weak Somewhat Predictive
Price Urban 10 11.143 0.266 9 0.14 0.17 Weak Somewhat Predictive
Age Urban 10 8.414 0.493 9 0.08 0.15 Weak Not Predictive
Education Urban 8 5.122 0.645 7 0.05 0.11 Weak Not Predictive
  • If information value is < 0.03 then predictive power = “Not Predictive”
  • If information value is 0.3 to 0.1 then predictive power = “Somewhat Predictive”
  • If information value is 0.1 to 0.3 then predictive power = “Meidum Predictive”
  • If information value is >0.3 then predictive power = “Highly Predictive”

  1. WOE(定义可参考CSDN博客) 类似于信息熵,也是考虑了异质性,当同质性越高,WOE越大。 \[IV = WOE \times (p_T - p_N) = (\ln(p_T)-\ln(p_N)) \cdot (p_T - p_N)\] 这是一个变量分了很多level,因此我们再\(\sum\)就好了。 这里用IV不用WOE,是考虑了样本大小的问题。 因为WOE是两个比例的比值,还是比例,也就是不考虑样本大小的。 但是比例就不同了,比例是这个特定样本和总体样本的比值,这就是考虑了样本大小的了, 所以更好,即使WOE好,但是适合的样本很小,那么IV也很小。