这个集成包很大,让人望而生畏,先举个小例子,让读者产生兴趣。 Molnar (2018) 给了一个很好的例子,重点解释了iml
和mlr
两个包的使用。 整理使用MASS
包的数据Boston
。
# knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache = T)
knitr::opts_chunk$set(warning = FALSE, message = FALSE
# , eval=F
)
library(tidyverse)
data("Boston", package = "MASS")
Boston %>% head
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
library("mlr")
# create an mlr task and model
tsk = makeRegrTask(data = Boston, target = "medv")
lrn = makeLearner("regr.randomForest", ntree = 100)
mod = train(lrn, tsk)
target
参数表示\(y\)变量。library(iml)
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(mod, data = X, y = Boston$medv)
theme_ilo <- function(){
theme_minimal() +
theme(
# text = element_text(family = "Bookman", color = "gray25"),
plot.subtitle = element_text(size = 9),
plot.caption = element_text(color = "gray30"),
# plot.background = element_rect(fill = "gray95"),
plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
axis.title.x = element_text(size=12,face = "bold"),
axis.title.y = element_text(size=12,face = "bold"),
# x和y的命名,要加粗,ppt才好看
axis.text.x = element_text(size=7, angle = 70, hjust = 1),
# 控制axis字体大小,7号大小最好
axis.text.y = element_text(size=7),
legend.title=element_blank()
)
}
imp = FeatureImp$new(predictor, loss = "mae")
imp$results %>% as_tibble()
## # A tibble: 13 x 5
## feature importance.05 importance importance.95 permutation.error
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 lstat 3.83 4.77 5.81 4.56
## 2 rm 3.14 3.30 4.19 3.16
## 3 nox 1.71 1.76 2.04 1.68
## 4 crim 1.62 1.76 1.90 1.68
## 5 ptratio 1.72 1.74 1.94 1.67
## 6 dis 1.60 1.65 1.67 1.58
## 7 indus 1.36 1.46 1.52 1.39
## 8 tax 1.30 1.39 1.45 1.33
## 9 age 1.27 1.36 1.51 1.30
## 10 black 1.20 1.30 1.35 1.25
## 11 rad 1.07 1.15 1.19 1.10
## 12 zn 1.02 1.07 1.12 1.03
## 13 chas 0.999 1.06 1.13 1.02
plot(imp) +
labs(
title = "Feature Importances",
subtitle = "median value of owner-occupied homes in $1000s.",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
例如lstat
变量,假设其他变量都不变的情况下,我们改动lstat
的值,看其对\(\hat y\)的边际效应。 因为样本之间是有差异的,因此这里每条线代表一个样本,线的聚集程度表达了样本在lstat
上的聚集状态。 黄线是样本的平均状态。 显然我们看到lstat
和\(\hat y\)是负相关的。
pdp.obj <- Partial$new(predictor, feature = "lstat")
plot(pdp.obj) +
labs(
title = "Partial dependence",
subtitle = "lower status of the population (percent).",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
pdp.obj$set.feature("rm")
pdp.obj$center(min(Boston$rm))
plot(pdp.obj) +
labs(
title = "Partial dependence",
subtitle = "average number of rooms per dwelling.",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
这里可以直接修改变量并且进行标准化,更好的比较。
为了提高可读性。 这属于代理模型的范畴,surrogate model。
tree <- TreeSurrogate$new(predictor, maxdepth = 2)
plot(tree) +
labs(
title = "Decision Tree on medv",
subtitle = "use lstat and rm",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
tree$predict(Boston) %>% head
## .y.hat
## 1 27.08021
## 2 27.08021
## 3 27.08021
## 4 27.08021
## 5 27.08021
## 6 27.08021
上面的代理模型其实牺牲了模型的解释程度来减低复杂性的。 这里提供局部模型针对一个样本进行解释。
lime.explain <- LocalModel$new(predictor, x.interest = X[1,])
lime.explain$results
## beta x.recoded effect x.original feature feature.value
## rm 4.1929976 6.575 27.568959 6.575 rm rm=6.575
## ptratio -0.5580217 15.300 -8.537732 15.3 ptratio ptratio=15.3
## lstat -0.4470613 4.980 -2.226365 4.98 lstat lstat=4.98
'x.interest' failed: Must have exactly 1 rows, but has 2 rows.
这里必须只有一行,这是函数的限制。 显然对样本1的预测只需要三个变量即可。
plot(lime.explain)
labs(
title = "Local model",
subtitle = "on obs #1, here bars represent beta value",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
## NULL
shapley <- Shapley$new(predictor, x.interest = X[1,])
shapley$results
## feature phi phi.var feature.value
## 1 crim -0.37508953 1.38479468 crim=0.00632
## 2 zn 0.01429238 0.05192473 zn=18
## 3 indus 0.75291481 1.11188271 indus=2.31
## 4 chas -0.03490917 0.02166565 chas=0
## 5 nox -0.25336473 0.59191256 nox=0.538
## 6 rm -0.94337577 17.65291459 rm=6.575
## 7 age -0.19079689 0.23419779 age=65.2
## 8 dis -0.36065000 2.48832700 dis=4.09
## 9 rad -0.37845033 0.09221521 rad=1
## 10 tax -0.31932799 0.53351829 tax=296
## 11 ptratio 0.66885776 1.41510262 ptratio=15.3
## 12 black -0.10135112 0.26673910 black=396.9
## 13 lstat 3.35425803 11.82283229 lstat=4.98
plot(shapley) +
labs(
title = "Game theory",
subtitle = "on obs #1, here bars represent shapley value",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
shapley$explain(x.interest = X[2,])
plot(shapley) +
labs(
title = "Game theory",
subtitle = "on obs #2, here bars represent shapley value",
caption = "by Jiaxiang Li - jiaxiangli.netlify.com"
) +
theme_ilo()
Molnar, Christoph. 2018. “Interpretable Machine Learning with Iml and Mlr.” 2018. https://mp.weixin.qq.com/s/jjGHGBD07OkPeN7hFLf7qQ.