SmartEDA
包[@SmartEDA]基本上把一个数据表内的数据展示的情况全部都展示了, 提高效率,以后每个数据表都可以进行一次数据展示。 本文参考 @SmartEDAvignettes 给出的小品文。
knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=T)
library(SmartEDA)
library(tidyverse)
library(ISLR)
library(formattable)
library(knitr)
library(DT)
Carseats <- ISLR::Carseats
使用函数ExpData
ExpData(data = Carseats, type = 1) %>% kable()
Descriptions | Obs |
---|---|
Sample size (Nrow) | 400 |
No. of Variables (Ncol) | 11 |
No. of Numeric Variables | 8 |
No. of Factor Variables | 3 |
No. of Text Variables | 0 |
No. of Logical Variables | 0 |
No. of Date Variables | 0 |
No. of Zero variance Variables (Uniform) | 0 |
%. of Variables having complete cases | 100% (11) |
%. of Variables having <50% missing cases | 0% (0) |
%. of Variables having >50% missing cases | 0% (0) |
%. of Variables having >90% missing cases | 0% (0) |
ExpData(data = Carseats, type = 2) %>% kable()
S.no | Variable Name | Variable Type | % of Missing | No. of Unique values |
---|---|---|---|---|
1 | Sales | numeric | 0 | 336 |
2 | CompPrice | numeric | 0 | 73 |
3 | Income | numeric | 0 | 98 |
4 | Advertising | numeric | 0 | 28 |
5 | Population | numeric | 0 | 275 |
6 | Price | numeric | 0 | 101 |
7 | ShelveLoc* | factor | 0 | 3 |
8 | Age | numeric | 0 | 56 |
9 | Education | numeric | 0 | 9 |
10 | Urban* | factor | 0 | 2 |
11 | US* | factor | 0 | 2 |
type=1
和type=2
分别按照简易两种方式展示。
按照分类变量和连续变量分别展示, 每一种分别按照是否监督、监督变量类别,出数据表、图。
ExpNumStat(
Carseats,
by = "A",
gp = NULL,
Qnt = seq(0, 1, 0.1),
MesofShape = 2,
Outlier = TRUE,
round = 2
) %>%
mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | LB.25% | UB.75% | nOutliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Advertising | All | 400 | 0 | 144 | 256 | 0 | 0 | 0 | 0.00% | 2654.00 | 0 | 29.00 | 6.63 | 5.00 | 6.65 | 1.00 | 12.00 | 0.64 | -0.55 | 0 | 0.00 | 0.00 | 0.00 | 2.00 | 5.00 | 8.40 | 11.00 | 13.00 | 16.0 | 29.00 | -18.00 | 30.00 | 0 |
Age | All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 21329.00 | 25 | 80.00 | 53.32 | 54.50 | 16.20 | 0.30 | 26.25 | -0.08 | -1.14 | 25 | 30.00 | 36.00 | 43.00 | 48.60 | 54.50 | 60.00 | 64.00 | 70.00 | 76.0 | 80.00 | 0.38 | 105.38 | 0 |
CompPrice | All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 49990.00 | 77 | 175.00 | 124.97 | 125.00 | 15.33 | 0.12 | 20.00 | -0.04 | 0.03 | 77 | 106.00 | 112.80 | 117.00 | 121.00 | 125.00 | 130.00 | 133.00 | 137.20 | 145.0 | 175.00 | 85.00 | 165.00 | 2 |
Income | All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 27463.00 | 21 | 120.00 | 68.66 | 69.00 | 27.99 | 0.41 | 48.25 | 0.05 | -1.09 | 21 | 30.00 | 39.00 | 48.00 | 62.00 | 69.00 | 77.00 | 84.60 | 96.20 | 107.0 | 120.00 | -29.62 | 163.38 | 0 |
Population | All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 105936.00 | 10 | 509.00 | 264.84 | 272.00 | 147.38 | 0.56 | 259.50 | -0.05 | -1.20 | 10 | 58.90 | 110.40 | 160.00 | 218.60 | 272.00 | 317.80 | 366.00 | 412.20 | 467.0 | 509.00 | -250.25 | 787.75 | 0 |
Price | All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 46318.00 | 24 | 191.00 | 115.80 | 117.00 | 23.68 | 0.20 | 31.00 | -0.12 | 0.43 | 24 | 87.00 | 96.80 | 104.00 | 110.00 | 117.00 | 122.00 | 128.30 | 134.00 | 146.0 | 191.00 | 53.50 | 177.50 | 5 |
Sales | All | 400 | 0 | 1 | 399 | 0 | 0 | 0 | 0.00% | 2998.53 | 0 | 16.27 | 7.50 | 7.49 | 2.82 | 0.38 | 3.93 | 0.18 | -0.10 | 0 | 4.12 | 5.07 | 5.87 | 6.61 | 7.49 | 8.08 | 8.82 | 9.76 | 11.3 | 16.27 | -0.50 | 15.21 | 2 |
ExpNumStat(
Carseats,
by = "A",
gp = "Price",
Qnt = seq(0, 1, 0.1),
MesofShape = 2,
Outlier = TRUE,
round = 2
) %>%
mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname | Group | Note | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | LB.25% | UB.75% | nOutliers | cor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Advertising | Price | Cor b/w Price | 400 | 0 | 144 | 256 | 0 | 0 | 0 | 0.00% | 2654.00 | 0 | 29.00 | 6.63 | 5.00 | 6.65 | 1.00 | 12.00 | 0.64 | -0.55 | 0 | 0.00 | 0.00 | 0.00 | 2.00 | 5.00 | 8.40 | 11.00 | 13.00 | 16.0 | 29.00 | -18.00 | 30.00 | 0 | 0.04 |
Age | Price | Cor b/w Price | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 21329.00 | 25 | 80.00 | 53.32 | 54.50 | 16.20 | 0.30 | 26.25 | -0.08 | -1.14 | 25 | 30.00 | 36.00 | 43.00 | 48.60 | 54.50 | 60.00 | 64.00 | 70.00 | 76.0 | 80.00 | 0.38 | 105.38 | 0 | -0.10 |
CompPrice | Price | Cor b/w Price | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 49990.00 | 77 | 175.00 | 124.97 | 125.00 | 15.33 | 0.12 | 20.00 | -0.04 | 0.03 | 77 | 106.00 | 112.80 | 117.00 | 121.00 | 125.00 | 130.00 | 133.00 | 137.20 | 145.0 | 175.00 | 85.00 | 165.00 | 2 | 0.58 |
Income | Price | Cor b/w Price | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 27463.00 | 21 | 120.00 | 68.66 | 69.00 | 27.99 | 0.41 | 48.25 | 0.05 | -1.09 | 21 | 30.00 | 39.00 | 48.00 | 62.00 | 69.00 | 77.00 | 84.60 | 96.20 | 107.0 | 120.00 | -29.62 | 163.38 | 0 | -0.06 |
Population | Price | Cor b/w Price | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 105936.00 | 10 | 509.00 | 264.84 | 272.00 | 147.38 | 0.56 | 259.50 | -0.05 | -1.20 | 10 | 58.90 | 110.40 | 160.00 | 218.60 | 272.00 | 317.80 | 366.00 | 412.20 | 467.0 | 509.00 | -250.25 | 787.75 | 0 | -0.01 |
Price | Price | Cor b/w Price | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 46318.00 | 24 | 191.00 | 115.80 | 117.00 | 23.68 | 0.20 | 31.00 | -0.12 | 0.43 | 24 | 87.00 | 96.80 | 104.00 | 110.00 | 117.00 | 122.00 | 128.30 | 134.00 | 146.0 | 191.00 | 53.50 | 177.50 | 5 | 1.00 |
Sales | Price | Cor b/w Price | 400 | 0 | 1 | 399 | 0 | 0 | 0 | 0.00% | 2998.53 | 0 | 16.27 | 7.50 | 7.49 | 2.82 | 0.38 | 3.93 | 0.18 | -0.10 | 0 | 4.12 | 5.07 | 5.87 | 6.61 | 7.49 | 8.08 | 8.82 | 9.76 | 11.3 | 16.27 | -0.50 | 15.21 | 2 | -0.44 |
ExpNumStat(
Carseats,
by = "GA",
gp = "Urban",
Qnt = seq(0, 1, 0.1),
MesofShape = 2,
Outlier = TRUE,
round = 2
) %>% mutate_at(vars(Per_of_Missing), percent) %>% kable()
Vname | Group | TN | nNeg | nZero | nPos | NegInf | PosInf | NA_Value | Per_of_Missing | sum | min | max | mean | median | SD | CV | IQR | Skewness | Kurtosis | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | LB.25% | UB.75% | nOutliers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Advertising | Urban:All | 400 | 0 | 144 | 256 | 0 | 0 | 0 | 0.00% | 2654.00 | 0.00 | 29.00 | 6.63 | 5.00 | 6.65 | 1.00 | 12.00 | 0.64 | -0.55 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 5.00 | 8.40 | 11.00 | 13.00 | 16.00 | 29.00 | -18.00 | 30.00 | 0 |
Advertising | Urban:Yes | 282 | 0 | 94 | 188 | 0 | 0 | 0 | 0.00% | 1922.00 | 0.00 | 29.00 | 6.82 | 5.00 | 6.73 | 0.99 | 12.00 | 0.66 | -0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 5.00 | 8.00 | 11.00 | 13.00 | 16.00 | 29.00 | -18.00 | 30.00 | 0 |
Advertising | Urban:No | 118 | 0 | 50 | 68 | 0 | 0 | 0 | 0.00% | 732.00 | 0.00 | 24.00 | 6.20 | 5.00 | 6.47 | 1.04 | 11.00 | 0.56 | -0.80 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | 9.00 | 10.90 | 12.00 | 14.00 | 24.00 | -16.50 | 27.50 | 0 |
Age | Urban:All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 21329.00 | 25.00 | 80.00 | 53.32 | 54.50 | 16.20 | 0.30 | 26.25 | -0.08 | -1.14 | 25.00 | 30.00 | 36.00 | 43.00 | 48.60 | 54.50 | 60.00 | 64.00 | 70.00 | 76.00 | 80.00 | 0.38 | 105.38 | 0 |
Age | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 15121.00 | 25.00 | 80.00 | 53.62 | 55.00 | 15.98 | 0.30 | 24.75 | -0.10 | -1.10 | 25.00 | 32.00 | 37.00 | 43.00 | 50.00 | 55.00 | 61.00 | 64.00 | 69.00 | 75.00 | 80.00 | 3.88 | 102.88 | 0 |
Age | Urban:No | 118 | 0 | 0 | 118 | 0 | 0 | 0 | 0.00% | 6208.00 | 25.00 | 80.00 | 52.61 | 53.50 | 16.76 | 0.32 | 28.00 | -0.01 | -1.20 | 25.00 | 29.00 | 36.00 | 41.00 | 47.00 | 53.50 | 57.00 | 63.90 | 71.60 | 76.00 | 80.00 | -3.00 | 109.00 | 0 |
CompPrice | Urban:All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 49990.00 | 77.00 | 175.00 | 124.97 | 125.00 | 15.33 | 0.12 | 20.00 | -0.04 | 0.03 | 77.00 | 106.00 | 112.80 | 117.00 | 121.00 | 125.00 | 130.00 | 133.00 | 137.20 | 145.00 | 175.00 | 85.00 | 165.00 | 2 |
CompPrice | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 35429.00 | 77.00 | 175.00 | 125.63 | 125.00 | 15.70 | 0.12 | 20.00 | -0.04 | 0.14 | 77.00 | 106.00 | 113.00 | 117.00 | 122.00 | 125.00 | 130.00 | 133.70 | 138.00 | 145.90 | 175.00 | 86.00 | 166.00 | 3 |
CompPrice | Urban:No | 118 | 0 | 0 | 118 | 0 | 0 | 0 | 0.00% | 14561.00 | 88.00 | 154.00 | 123.40 | 123.00 | 14.36 | 0.12 | 20.25 | -0.10 | -0.47 | 88.00 | 106.00 | 111.00 | 115.10 | 120.80 | 123.00 | 128.00 | 132.00 | 135.00 | 142.30 | 154.00 | 82.88 | 163.88 | 0 |
Income | Urban:All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 27463.00 | 21.00 | 120.00 | 68.66 | 69.00 | 27.99 | 0.41 | 48.25 | 0.05 | -1.09 | 21.00 | 30.00 | 39.00 | 48.00 | 62.00 | 69.00 | 77.00 | 84.60 | 96.20 | 107.00 | 120.00 | -29.62 | 163.38 | 0 |
Income | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 19555.00 | 21.00 | 120.00 | 69.34 | 70.00 | 27.27 | 0.39 | 47.00 | 0.01 | -1.07 | 21.00 | 32.00 | 41.00 | 52.00 | 63.00 | 70.00 | 79.00 | 87.00 | 96.00 | 105.00 | 120.00 | -26.50 | 161.50 | 0 |
Income | Urban:No | 118 | 0 | 0 | 118 | 0 | 0 | 0 | 0.00% | 7908.00 | 21.00 | 120.00 | 67.02 | 67.00 | 29.68 | 0.44 | 49.75 | 0.14 | -1.12 | 21.00 | 26.70 | 36.00 | 46.00 | 59.80 | 67.00 | 73.40 | 82.00 | 99.80 | 110.30 | 120.00 | -34.38 | 164.62 | 0 |
Population | Urban:All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 105936.00 | 10.00 | 509.00 | 264.84 | 272.00 | 147.38 | 0.56 | 259.50 | -0.05 | -1.20 | 10.00 | 58.90 | 110.40 | 160.00 | 218.60 | 272.00 | 317.80 | 366.00 | 412.20 | 467.00 | 509.00 | -250.25 | 787.75 | 0 |
Population | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 73288.00 | 12.00 | 508.00 | 259.89 | 274.00 | 144.61 | 0.56 | 237.75 | -0.04 | -1.17 | 12.00 | 60.00 | 105.00 | 156.60 | 217.20 | 274.00 | 315.60 | 353.70 | 405.60 | 453.50 | 508.00 | -217.62 | 733.38 | 0 |
Population | Urban:No | 118 | 0 | 0 | 118 | 0 | 0 | 0 | 0.00% | 32648.00 | 10.00 | 509.00 | 276.68 | 271.00 | 153.76 | 0.56 | 264.00 | -0.09 | -1.27 | 10.00 | 56.10 | 122.40 | 176.50 | 219.60 | 271.00 | 344.00 | 400.00 | 440.20 | 480.30 | 509.00 | -252.00 | 804.00 | 0 |
Price | Urban:All | 400 | 0 | 0 | 400 | 0 | 0 | 0 | 0.00% | 46318.00 | 24.00 | 191.00 | 115.80 | 117.00 | 23.68 | 0.20 | 31.00 | -0.12 | 0.43 | 24.00 | 87.00 | 96.80 | 104.00 | 110.00 | 117.00 | 122.00 | 128.30 | 134.00 | 146.00 | 191.00 | 53.50 | 177.50 | 5 |
Price | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 32857.00 | 24.00 | 191.00 | 116.51 | 118.00 | 23.94 | 0.21 | 30.75 | -0.23 | 0.55 | 24.00 | 87.00 | 97.00 | 104.30 | 110.00 | 118.00 | 124.00 | 129.00 | 134.00 | 146.90 | 191.00 | 54.88 | 177.88 | 4 |
Price | Urban:No | 118 | 0 | 0 | 118 | 0 | 0 | 0 | 0.00% | 13461.00 | 49.00 | 185.00 | 114.08 | 112.50 | 23.04 | 0.20 | 31.00 | 0.13 | 0.22 | 49.00 | 88.70 | 96.00 | 101.20 | 106.80 | 112.50 | 120.00 | 125.90 | 131.60 | 144.00 | 185.00 | 50.50 | 174.50 | 2 |
Sales | Urban:All | 400 | 0 | 1 | 399 | 0 | 0 | 0 | 0.00% | 2998.53 | 0.00 | 16.27 | 7.50 | 7.49 | 2.82 | 0.38 | 3.93 | 0.18 | -0.10 | 0.00 | 4.12 | 5.07 | 5.87 | 6.61 | 7.49 | 8.08 | 8.82 | 9.76 | 11.30 | 16.27 | -0.50 | 15.21 | 2 |
Sales | Urban:Yes | 282 | 0 | 0 | 282 | 0 | 0 | 0 | 0.00% | 2106.03 | 0.37 | 16.27 | 7.47 | 7.42 | 2.84 | 0.38 | 3.92 | 0.27 | -0.10 | 0.37 | 4.10 | 5.04 | 5.87 | 6.57 | 7.42 | 7.91 | 8.77 | 9.71 | 11.48 | 16.27 | -0.50 | 15.17 | 2 |
Sales | Urban:No | 118 | 0 | 1 | 117 | 0 | 0 | 0 | 0.00% | 892.50 | 0.00 | 14.90 | 7.56 | 7.67 | 2.81 | 0.37 | 3.90 | -0.03 | -0.06 | 0.00 | 4.30 | 5.14 | 5.84 | 6.87 | 7.67 | 8.49 | 8.98 | 9.77 | 11.11 | 14.90 | -0.40 | 15.18 | 0 |
Vname
– Variable name - 变量名称Group
– Target variable -TN
– Total sample (inculded NA observations) - 样本总数nNeg
– Total negative observations - 负样本数量nZero
– Total zero observations - 零值数量nPos
– Total positive observations - 正样本数量NegInf
– Negative infinite count - 负无穷大极值PosInf
– Positive infinite count - 正无穷大极值NA_value
– Not Applicable count - 缺失值Per_of_Missing
– Percentage of missings - 缺失率Min
– minimum value - 最小值Max
– maximum value - 最大值Mean
– average value - 平均值Median
– median value - 中位数SD
– Standard deviation - 总体标准差CV
– coefficient of variations (SD/mean)*100 - z scoreIQR
– Inter quartile range - 四分位距 \(QD = \frac{Q3-Q1}{2}\)Qnt
– Specified quantiles - 百分位点MesofShape
– Skewness and Kurtosis - 偏度和峰度Outlier
– Number of outliers - 异常值数量Cor
– Correlation b/w target and independent variables - 自变量和因变量相关性ExpNumViz(Carseats,
nlim = 10,
Page = c(2, 4),
sample = NULL)
## $`0`
nlim
: 少于10个样本不画图sample
: 随机选择变量进行展示,NULL
表示全部展示ExpNumViz(
Carseats,
target = "Price",
Page = c(2, 4)
)
## $`0`
ExpNumViz(
Carseats,
target = "Urban",
col = c("pink", "yellow", "orange"),
Page = c(2, 4)
)
## $`0`
ExpCTable(Carseats) %>% kable()
Variable | Valid | Frequency | Percent | CumPercent |
---|---|---|---|---|
ShelveLoc | Bad | 96 | 24.00 | 24.00 |
ShelveLoc | Good | 85 | 21.25 | 45.25 |
ShelveLoc | Medium | 219 | 54.75 | 100.00 |
ShelveLoc | TOTAL | 400 | NA | NA |
Urban | No | 118 | 29.50 | 29.50 |
Urban | Yes | 282 | 70.50 | 100.00 |
Urban | TOTAL | 400 | NA | NA |
US | No | 142 | 35.50 | 35.50 |
US | Yes | 258 | 64.50 | 100.00 |
US | TOTAL | 400 | NA | NA |
Education | 10 | 48 | 12.00 | 12.00 |
Education | 11 | 48 | 12.00 | 24.00 |
Education | 12 | 49 | 12.25 | 36.25 |
Education | 13 | 43 | 10.75 | 47.00 |
Education | 14 | 40 | 10.00 | 57.00 |
Education | 15 | 36 | 9.00 | 66.00 |
Education | 16 | 47 | 11.75 | 77.75 |
Education | 17 | 49 | 12.25 | 90.00 |
Education | 18 | 40 | 10.00 | 100.00 |
Education | TOTAL | 400 | NA | NA |
ExpCTable(
Carseats,
Target = "Price"
) %>% kable()
VARIABLE | CATEGORY | Price:(23.8,79.7] | Price:(79.7,135] | Price:(135,191] | TOTAL |
---|---|---|---|---|---|
ShelveLoc | Bad | 6 | 75 | 15 | 96 |
ShelveLoc | Good | 7 | 61 | 17 | 85 |
ShelveLoc | Medium | 11 | 170 | 38 | 219 |
ShelveLoc | TOTAL | 24 | 306 | 70 | 400 |
Urban | No | 7 | 90 | 21 | 118 |
Urban | Yes | 17 | 216 | 49 | 282 |
Urban | TOTAL | 24 | 306 | 70 | 400 |
US | No | 10 | 105 | 27 | 142 |
US | Yes | 14 | 201 | 43 | 258 |
US | TOTAL | 24 | 306 | 70 | 400 |
Education | 10 | 2 | 37 | 9 | 48 |
Education | 11 | 1 | 41 | 6 | 48 |
Education | 12 | 3 | 39 | 7 | 49 |
Education | 13 | 1 | 31 | 11 | 43 |
Education | 14 | 1 | 34 | 5 | 40 |
Education | 15 | 6 | 23 | 7 | 36 |
Education | 16 | 3 | 39 | 5 | 47 |
Education | 17 | 3 | 35 | 11 | 49 |
Education | 18 | 4 | 27 | 9 | 40 |
Education | TOTAL | 24 | 306 | 70 | 400 |
ExpCTable(
Carseats,
Target = "Urban"
) %>% kable()
VARIABLE | CATEGORY | Urban:No | Urban:Yes | TOTAL |
---|---|---|---|---|
ShelveLoc | Bad | 22 | 74 | 96 |
ShelveLoc | Good | 28 | 57 | 85 |
ShelveLoc | Medium | 68 | 151 | 219 |
ShelveLoc | TOTAL | 118 | 282 | 400 |
US | No | 46 | 96 | 142 |
US | Yes | 72 | 186 | 258 |
US | TOTAL | 118 | 282 | 400 |
Education | 10 | 14 | 34 | 48 |
Education | 11 | 15 | 33 | 48 |
Education | 12 | 9 | 40 | 49 |
Education | 13 | 12 | 31 | 43 |
Education | 14 | 15 | 25 | 40 |
Education | 15 | 10 | 26 | 36 |
Education | 16 | 15 | 32 | 47 |
Education | 17 | 17 | 32 | 49 |
Education | 18 | 11 | 29 | 40 |
Education | TOTAL | 118 | 282 | 400 |
ExpCTable(
mtcars,
Target = "gear",
margin = 1,
clim = 10,
nlim = 3,
bin = NULL,
per = FALSE
)
## VARIABLE CATEGORY gear:3 gear:4 gear:5 TOTAL
## 1 cyl 4 1 8 2 11
## 2 cyl 6 2 4 1 7
## 3 cyl 8 12 0 2 14
## 4 cyl TOTAL 15 12 5 32
## 5 vs 0 12 2 4 18
## 6 vs 1 3 10 1 14
## 7 vs TOTAL 15 12 5 32
## 8 am 0 15 4 0 19
## 9 am 1 0 8 5 13
## 10 am TOTAL 15 12 5 32
## 11 gear 3 15 0 0 15
## 12 gear 4 0 12 0 12
## 13 gear 5 0 0 5 5
## 14 gear TOTAL 15 12 5 32
ExpCatViz(
Carseats,
target = NULL,
fname = NULL,
clim = 10,
margin = 2,
Page = c(2, 1),
sample = NULL
)
## $`0`
ExpCatViz(
Carseats,
target = "Urban",
fname = NULL,
clim = 10,
col = NULL,
margin = 2,
Page = c(2, 1),
sample = 2
)
## $`0`
Carseats %>% distinct(Urban)
## Urban
## 1 Yes
## 2 No
Carseats %>% names
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
ExpCatStat(
Carseats,
Target = "Urban",
result = "IV",
clim = 10,
nlim = 5,
Pclass = "Yes",
plot = TRUE
) %>% kable()
Variable | Class | Out_1 | Out_0 | TOTAL | Per_1 | Per_0 | Odds | WOE | IV | Ref_1 | Ref_0 |
---|---|---|---|---|---|---|---|---|---|---|---|
ShelveLoc.1 | Bad | 74 | 22 | 96 | 0.26 | 0.19 | 0.85 | 0.31 | 0.02 | Yes | No |
ShelveLoc.2 | Good | 57 | 28 | 85 | 0.20 | 0.24 | 0.81 | -0.19 | 0.01 | Yes | No |
ShelveLoc.3 | Medium | 151 | 68 | 219 | 0.54 | 0.58 | 1.55 | -0.07 | 0.00 | Yes | No |
US.1 | No | 96 | 46 | 142 | 0.34 | 0.39 | 1.24 | -0.14 | 0.01 | Yes | No |
US.2 | Yes | 186 | 72 | 258 | 0.66 | 0.61 | 0.81 | 0.08 | 0.00 | Yes | No |
Sales.1 | [0,4.11] | 29 | 11 | 40 | 0.10 | 0.09 | 1.29 | 0.10 | 0.00 | Yes | No |
Sales.2 | (4.11,5.05] | 29 | 11 | 40 | 0.10 | 0.09 | 0.75 | 0.10 | 0.00 | Yes | No |
Sales.3 | (5.05,5.86] | 26 | 14 | 40 | 0.09 | 0.12 | 0.75 | -0.29 | 0.01 | Yes | No |
Sales.4 | (5.86,6.59] | 30 | 10 | 40 | 0.11 | 0.08 | 0.59 | 0.32 | 0.01 | Yes | No |
Sales.5 | (6.59,7.49] | 32 | 9 | 41 | 0.11 | 0.08 | 1.44 | 0.32 | 0.01 | Yes | No |
Sales.6 | (7.49,8.07] | 30 | 9 | 39 | 0.11 | 0.08 | 1.55 | 0.32 | 0.01 | Yes | No |
Sales.7 | (8.07,8.8] | 24 | 16 | 40 | 0.09 | 0.14 | 1.29 | -0.45 | 0.02 | Yes | No |
Sales.8 | (8.8,9.71] | 26 | 14 | 40 | 0.09 | 0.12 | 0.75 | -0.29 | 0.01 | Yes | No |
Sales.9 | (9.71,11.3] | 26 | 14 | 40 | 0.09 | 0.12 | 1.11 | -0.29 | 0.01 | Yes | No |
Sales.10 | (11.3,16.3] | 30 | 10 | 40 | 0.11 | 0.08 | 1.11 | 0.32 | 0.01 | Yes | No |
CompPrice.1 | [77,106] | 30 | 13 | 43 | 0.11 | 0.11 | 1.39 | 0.00 | 0.00 | Yes | No |
CompPrice.2 | (106,112] | 22 | 15 | 37 | 0.08 | 0.13 | 1.05 | -0.48 | 0.02 | Yes | No |
CompPrice.3 | (112,117] | 34 | 11 | 45 | 0.12 | 0.09 | 1.23 | 0.29 | 0.01 | Yes | No |
CompPrice.4 | (117,121] | 24 | 13 | 37 | 0.09 | 0.11 | 0.95 | -0.20 | 0.00 | Yes | No |
CompPrice.5 | (121,125] | 33 | 15 | 48 | 0.12 | 0.13 | 1.24 | -0.08 | 0.00 | Yes | No |
CompPrice.6 | (125,130] | 29 | 10 | 39 | 0.10 | 0.08 | 0.91 | 0.22 | 0.00 | Yes | No |
CompPrice.7 | (130,133] | 25 | 11 | 36 | 0.09 | 0.09 | 0.75 | 0.00 | 0.00 | Yes | No |
CompPrice.8 | (133,137] | 26 | 9 | 35 | 0.09 | 0.08 | 1.33 | 0.11 | 0.00 | Yes | No |
CompPrice.9 | (137,145] | 30 | 12 | 42 | 0.11 | 0.10 | 0.58 | 0.10 | 0.00 | Yes | No |
CompPrice.10 | (145,175] | 29 | 9 | 38 | 0.10 | 0.08 | 0.96 | 0.22 | 0.00 | Yes | No |
Income.1 | [21,30] | 24 | 17 | 41 | 0.09 | 0.14 | 0.64 | -0.45 | 0.02 | Yes | No |
Income.2 | (30,39] | 30 | 11 | 41 | 0.11 | 0.09 | 1.33 | 0.20 | 0.00 | Yes | No |
Income.3 | (39,48] | 29 | 10 | 39 | 0.10 | 0.08 | 1.76 | 0.22 | 0.00 | Yes | No |
Income.4 | (48,62] | 28 | 15 | 43 | 0.10 | 0.13 | 1.44 | -0.26 | 0.01 | Yes | No |
Income.5 | (62,69] | 29 | 13 | 42 | 0.10 | 0.11 | 1.05 | -0.09 | 0.00 | Yes | No |
Income.6 | (69,77] | 25 | 10 | 35 | 0.09 | 0.08 | 0.93 | 0.11 | 0.00 | Yes | No |
Income.7 | (77,84] | 30 | 9 | 39 | 0.11 | 0.08 | 0.76 | 0.32 | 0.01 | Yes | No |
Income.8 | (84,96] | 32 | 8 | 40 | 0.11 | 0.07 | 1.24 | 0.45 | 0.02 | Yes | No |
Income.9 | (96,107] | 31 | 10 | 41 | 0.11 | 0.08 | 1.16 | 0.32 | 0.01 | Yes | No |
Income.10 | (107,120] | 24 | 15 | 39 | 0.09 | 0.13 | 0.55 | -0.37 | 0.01 | Yes | No |
Advertising.1 | [0,2] | 108 | 53 | 161 | 0.38 | 0.45 | 1.13 | -0.17 | 0.01 | Yes | No |
Advertising.2 | (2,5] | 37 | 8 | 45 | 0.13 | 0.07 | 1.40 | 0.62 | 0.04 | Yes | No |
Advertising.3 | (5,8] | 25 | 9 | 34 | 0.09 | 0.08 | 0.95 | 0.11 | 0.00 | Yes | No |
Advertising.4 | (8,11] | 37 | 20 | 57 | 0.13 | 0.17 | 0.74 | -0.27 | 0.01 | Yes | No |
Advertising.5 | (11,13] | 25 | 11 | 36 | 0.09 | 0.09 | 1.18 | 0.00 | 0.00 | Yes | No |
Advertising.6 | (13,16] | 26 | 8 | 34 | 0.09 | 0.07 | 2.08 | 0.25 | 0.00 | Yes | No |
Advertising.7 | (16,29] | 24 | 9 | 33 | 0.09 | 0.08 | 0.76 | 0.11 | 0.00 | Yes | No |
Population.1 | [10,58] | 26 | 14 | 40 | 0.09 | 0.12 | 0.57 | -0.29 | 0.01 | Yes | No |
Population.2 | (58,108] | 33 | 7 | 40 | 0.12 | 0.06 | 0.89 | 0.69 | 0.04 | Yes | No |
Population.3 | (108,160] | 29 | 12 | 41 | 0.10 | 0.10 | 0.82 | 0.00 | 0.00 | Yes | No |
Population.4 | (160,218] | 25 | 14 | 39 | 0.09 | 0.12 | 1.82 | -0.29 | 0.01 | Yes | No |
Population.5 | (218,272] | 28 | 13 | 41 | 0.10 | 0.11 | 1.70 | -0.09 | 0.00 | Yes | No |
Population.6 | (272,317] | 31 | 8 | 39 | 0.11 | 0.07 | 0.89 | 0.45 | 0.02 | Yes | No |
Population.7 | (317,366] | 33 | 8 | 41 | 0.12 | 0.07 | 0.72 | 0.54 | 0.03 | Yes | No |
Population.8 | (366,412] | 26 | 13 | 39 | 0.09 | 0.11 | 1.01 | -0.20 | 0.00 | Yes | No |
Population.9 | (412,467] | 28 | 13 | 41 | 0.10 | 0.11 | 2.10 | -0.09 | 0.00 | Yes | No |
Population.10 | (467,509] | 23 | 16 | 39 | 0.08 | 0.14 | 0.75 | -0.56 | 0.03 | Yes | No |
Price.1 | [24,87] | 30 | 11 | 41 | 0.11 | 0.09 | 1.24 | 0.20 | 0.00 | Yes | No |
Price.2 | (87,96] | 24 | 15 | 39 | 0.09 | 0.13 | 0.78 | -0.37 | 0.01 | Yes | No |
Price.3 | (96,104] | 31 | 17 | 48 | 0.11 | 0.14 | 2.54 | -0.24 | 0.01 | Yes | No |
Price.4 | (104,110] | 30 | 8 | 38 | 0.11 | 0.07 | 0.82 | 0.45 | 0.02 | Yes | No |
Price.5 | (110,117] | 21 | 14 | 35 | 0.07 | 0.12 | 1.11 | -0.54 | 0.03 | Yes | No |
Price.6 | (117,122] | 29 | 11 | 40 | 0.10 | 0.09 | 0.60 | 0.10 | 0.00 | Yes | No |
Price.7 | (122,128] | 26 | 13 | 39 | 0.09 | 0.11 | 1.64 | -0.20 | 0.00 | Yes | No |
Price.8 | (128,134] | 39 | 7 | 46 | 0.14 | 0.06 | 0.73 | 0.85 | 0.07 | Yes | No |
Price.9 | (134,146] | 23 | 12 | 35 | 0.08 | 0.10 | 0.64 | -0.22 | 0.00 | Yes | No |
Price.10 | (146,191] | 29 | 10 | 39 | 0.10 | 0.08 | 1.16 | 0.22 | 0.00 | Yes | No |
Age.1 | [25,30] | 27 | 16 | 43 | 0.10 | 0.14 | 1.48 | -0.34 | 0.01 | Yes | No |
Age.2 | (30,36] | 29 | 10 | 39 | 0.10 | 0.08 | 0.61 | 0.22 | 0.00 | Yes | No |
Age.3 | (36,43] | 31 | 13 | 44 | 0.11 | 0.11 | 1.46 | 0.00 | 0.00 | Yes | No |
Age.4 | (43,48] | 22 | 12 | 34 | 0.08 | 0.10 | 1.95 | -0.22 | 0.00 | Yes | No |
Age.5 | (48,54] | 29 | 11 | 40 | 0.10 | 0.09 | 0.82 | 0.10 | 0.00 | Yes | No |
Age.6 | (54,60] | 30 | 15 | 45 | 0.11 | 0.13 | 1.11 | -0.16 | 0.00 | Yes | No |
Age.7 | (60,64] | 35 | 8 | 43 | 0.12 | 0.07 | 0.75 | 0.54 | 0.03 | Yes | No |
Age.8 | (64,70] | 27 | 8 | 35 | 0.10 | 0.07 | 1.00 | 0.36 | 0.01 | Yes | No |
Age.9 | (70,76] | 28 | 18 | 46 | 0.10 | 0.15 | 1.24 | -0.40 | 0.02 | Yes | No |
Age.10 | (76,80] | 24 | 7 | 31 | 0.09 | 0.06 | 0.68 | 0.41 | 0.01 | Yes | No |
Education.1 | [10,11] | 67 | 29 | 96 | 0.24 | 0.25 | 1.11 | -0.04 | 0.00 | Yes | No |
Education.2 | (11,12] | 40 | 9 | 49 | 0.14 | 0.08 | 0.76 | 0.56 | 0.03 | Yes | No |
Education.3 | (12,13] | 31 | 12 | 43 | 0.11 | 0.10 | 0.88 | 0.10 | 0.00 | Yes | No |
Education.4 | (13,14] | 25 | 15 | 40 | 0.09 | 0.13 | 1.10 | -0.37 | 0.01 | Yes | No |
Education.5 | (14,15] | 26 | 10 | 36 | 0.09 | 0.08 | 0.67 | 0.11 | 0.00 | Yes | No |
Education.6 | (15,16] | 32 | 15 | 47 | 0.11 | 0.13 | 1.09 | -0.16 | 0.00 | Yes | No |
Education.7 | (16,17] | 32 | 17 | 49 | 0.11 | 0.14 | 2.00 | -0.24 | 0.01 | Yes | No |
Education.8 | (17,18] | 29 | 11 | 40 | 0.10 | 0.09 | 0.96 | 0.10 | 0.00 | Yes | No |
Variable
– variable name - 变量名称Target
– Target variable label - 因变量class
– name of bin (variable value otherwise) - 自变量类别out0
– number of good observations - 正样本数量out1
– number of bad observations - 负样本数量Total
– Total values for each category - 类别中样本数量pct1
– good observations / total good observations - 正样本比例pct0
– bad observations / total bad observations - 负样本比例odds
– pct1/pct0 - 发生比woe
– Weight of Evidence – calculated as ln(odds) - WOEiv
– Information Value - ln(odds) * (pct0 – pct1) - IV[^IV]这里举例计算WOE和IV 1。
以ShelveLoc
变量为例, 一共有三个类别
Bad
Good
Medium
以类别Bad
为例, 改subset下,有
因此一共有96个样本
样本的正负比例为
Carseats$Urban %>% table()
## .
## No Yes
## 118 282
因此
\[\text{Per_1} = \frac{72}{282} = 0.2553191\] \[\text{Per_0} = \frac{22}{118} = 0.1864407\]
因此
\[WOE = \log(\text{Per_1}) - \log(\text{Per_0}) = 0.3144012\]
因此
\[WOE = (\text{Per_1} - \text{Per_0})[\log(\text{Per_1}) - \log(\text{Per_0})] = 0.0216555\]
然后把三个类别的IV都求解出来
最后
\[IV_{\text{ShelveLoc}} = 0.026 + 0.006 + 0.003 = 0.035\]
这个值可以在 @ref(IV-tbl)中得到验证。
ExpCatStat(
Carseats,
Target = "Urban",
result = "Stat",
clim = 10,
nlim = 5,
Pclass = "Yes",
plot = TRUE
) %>%
kable()
Variable | Target | Unique | Chi-squared | p-value | df | IV Value | Cramers V | Degree of Association | Predictive Power |
---|---|---|---|---|---|---|---|---|---|
ShelveLoc | Urban | 3 | 2.738 | 0.254 | 2 | 0.03 | 0.08 | Very Weak | Not Predictive |
US | Urban | 2 | 0.684 | 0.408 | 1 | 0.01 | 0.04 | Very Weak | Not Predictive |
Sales | Urban | 10 | 6.696 | 0.669 | 9 | 0.09 | 0.13 | Weak | Somewhat Predictive |
CompPrice | Urban | 10 | 4.543 | 0.872 | 9 | 0.03 | 0.11 | Weak | Not Predictive |
Income | Urban | 10 | 8.428 | 0.492 | 9 | 0.08 | 0.15 | Weak | Not Predictive |
Advertising | Urban | 7 | 5.565 | 0.474 | 6 | 0.06 | 0.12 | Weak | Not Predictive |
Population | Urban | 10 | 10.560 | 0.307 | 9 | 0.14 | 0.16 | Weak | Somewhat Predictive |
Price | Urban | 10 | 11.143 | 0.266 | 9 | 0.14 | 0.17 | Weak | Somewhat Predictive |
Age | Urban | 10 | 8.414 | 0.493 | 9 | 0.08 | 0.15 | Weak | Not Predictive |
Education | Urban | 8 | 5.122 | 0.645 | 7 | 0.05 | 0.11 | Weak | Not Predictive |