这个集成包很大,让人望而生畏,先举个小例子,让读者产生兴趣。 Molnar (2018) 给了一个很好的例子,重点解释了imlmlr两个包的使用。 整理使用MASS包的数据Boston

# knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache = T)
knitr::opts_chunk$set(warning = FALSE, message = FALSE
                      # , eval=F
                      )
library(tidyverse)
data("Boston", package  = "MASS")
Boston %>% head
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
library("mlr")

# create an mlr task and model
tsk = makeRegrTask(data = Boston, target = "medv")
lrn = makeLearner("regr.randomForest", ntree = 100)
mod = train(lrn, tsk)
library(iml)
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(mod, data = X, y = Boston$medv)

特征重要性情况

theme_ilo <- function(){
    theme_minimal() +
    theme(
    # text = element_text(family = "Bookman", color = "gray25"),
    plot.subtitle = element_text(size = 9),
    plot.caption = element_text(color = "gray30"),
    # plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
    axis.title.x = element_text(size=12,face = "bold"),
    axis.title.y = element_text(size=12,face = "bold"),
    # x和y的命名,要加粗,ppt才好看
    axis.text.x = element_text(size=7, angle = 70, hjust = 1),
    # 控制axis字体大小,7号大小最好
    axis.text.y = element_text(size=7),
    legend.title=element_blank()
    )
}
imp = FeatureImp$new(predictor, loss = "mae")
imp$results %>% as_tibble()
## # A tibble: 13 x 5
##    feature importance.05 importance importance.95 permutation.error
##    <chr>           <dbl>      <dbl>         <dbl>             <dbl>
##  1 lstat           3.83        4.77          5.81              4.56
##  2 rm              3.14        3.30          4.19              3.16
##  3 nox             1.71        1.76          2.04              1.68
##  4 crim            1.62        1.76          1.90              1.68
##  5 ptratio         1.72        1.74          1.94              1.67
##  6 dis             1.60        1.65          1.67              1.58
##  7 indus           1.36        1.46          1.52              1.39
##  8 tax             1.30        1.39          1.45              1.33
##  9 age             1.27        1.36          1.51              1.30
## 10 black           1.20        1.30          1.35              1.25
## 11 rad             1.07        1.15          1.19              1.10
## 12 zn              1.02        1.07          1.12              1.03
## 13 chas            0.999       1.06          1.13              1.02
plot(imp) +
  labs(
    title = "Feature Importances",
    subtitle = "median value of owner-occupied homes in $1000s.",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

边际效应

例如lstat变量,假设其他变量都不变的情况下,我们改动lstat的值,看其对\(\hat y\)的边际效应。 因为样本之间是有差异的,因此这里每条线代表一个样本,线的聚集程度表达了样本在lstat上的聚集状态。 黄线是样本的平均状态。 显然我们看到lstat\(\hat y\)是负相关的。

pdp.obj <- Partial$new(predictor, feature = "lstat")
plot(pdp.obj) +
  labs(
    title = "Partial dependence",
    subtitle = "lower status of the population (percent).",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

pdp.obj$set.feature("rm")
pdp.obj$center(min(Boston$rm))
plot(pdp.obj) +
    labs(
    title = "Partial dependence",
    subtitle = "average number of rooms per dwelling.",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

这里可以直接修改变量并且进行标准化,更好的比较。

简单决策树预测

为了提高可读性。 这属于代理模型的范畴,surrogate model。

tree <- TreeSurrogate$new(predictor, maxdepth = 2)
plot(tree) +
  labs(
    title = "Decision Tree on medv",
    subtitle = "use lstat and rm",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

tree$predict(Boston) %>% head
##     .y.hat
## 1 27.08021
## 2 27.08021
## 3 27.08021
## 4 27.08021
## 5 27.08021
## 6 27.08021

用局部模型解释一个样本

上面的代理模型其实牺牲了模型的解释程度来减低复杂性的。 这里提供局部模型针对一个样本进行解释。

lime.explain <- LocalModel$new(predictor, x.interest = X[1,])
lime.explain$results
##               beta x.recoded    effect x.original feature feature.value
## rm       4.1929976     6.575 27.568959      6.575      rm      rm=6.575
## ptratio -0.5580217    15.300 -8.537732       15.3 ptratio  ptratio=15.3
## lstat   -0.4470613     4.980 -2.226365       4.98   lstat    lstat=4.98
'x.interest' failed: Must have exactly 1 rows, but has 2 rows.

这里必须只有一行,这是函数的限制。 显然对样本1的预测只需要三个变量即可。

plot(lime.explain)

  labs(
    title = "Local model",
    subtitle = "on obs #1, here bars represent beta value",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()
## NULL

博弈论方法解释一个样本

shapley <- Shapley$new(predictor, x.interest = X[1,])
shapley$results
##    feature         phi     phi.var feature.value
## 1     crim -0.37508953  1.38479468  crim=0.00632
## 2       zn  0.01429238  0.05192473         zn=18
## 3    indus  0.75291481  1.11188271    indus=2.31
## 4     chas -0.03490917  0.02166565        chas=0
## 5      nox -0.25336473  0.59191256     nox=0.538
## 6       rm -0.94337577 17.65291459      rm=6.575
## 7      age -0.19079689  0.23419779      age=65.2
## 8      dis -0.36065000  2.48832700      dis=4.09
## 9      rad -0.37845033  0.09221521         rad=1
## 10     tax -0.31932799  0.53351829       tax=296
## 11 ptratio  0.66885776  1.41510262  ptratio=15.3
## 12   black -0.10135112  0.26673910   black=396.9
## 13   lstat  3.35425803 11.82283229    lstat=4.98
plot(shapley) +
  labs(
    title = "Game theory",
    subtitle = "on obs #1, here bars represent shapley value",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

shapley$explain(x.interest = X[2,])
plot(shapley) +
  labs(
    title = "Game theory",
    subtitle = "on obs #2, here bars represent shapley value",
    caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
  ) +
  theme_ilo()

Molnar, Christoph. 2018. “Interpretable Machine Learning with Iml and Mlr.” 2018. https://mp.weixin.qq.com/s/jjGHGBD07OkPeN7hFLf7qQ.