参考 Liu and Maldonado (2018, 116–28)

for many practical applications, for instance credit card fraud or intrusion detection, we sometimes (or rather often) lack such labeled data.

Traditionally, fraud detection systems rely on the creation of manually engineered features by subject matter experts, working either directly with financial institutions or with specialized software vendors. One of the biggest challenges in fraud detection is the availability of labelled datasets, which are often hard or even impossible to come by.

反欺诈的数据问题是欺诈用户比例非常低。 因此从传统监督学习的角度出发,就需要使用 Sampling。 当然也可以使用网络分析。

Instead of relying on the label, we will treat the problem as an anomaly detection or outlier detection problem and use autoencoders, as before.

但是从异常值分析的角度,将欺诈用户作为异常值的方法,也有贡献的方式。

Our first fraud example comes from a dataset made public in Kaggle, (https://www.kaggle.com/dalpozz/creditcardfraud), by researchers from the Université Libre de Bruxelles in Belgium (for the full work, you can read their paper: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi, Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015).

因此我们使用的数据来自一份 paper,作者的目的是为了介绍一种非平衡数据的处理方式。 下载地址已挂,使用这个www.kaggle.com

It contains just numerical information factors which are the aftereffect of a PCA change. Because of classification issues, the creators can’t give the first highlights and more foundation data about the information. Features V1, V2, … V28 are the chief segments got with PCA, the main features which have not been changed with PCA are Time and Amount.

数据进行了清洗,使用 PCA 的主成分代替。因此使得用户无法从特征的定义出发,进行特征工程。

library(glue)
library(tidyverse)
library(here)
df <- 
    glue("datasets/creditcard_{0:9}.rds") %>%
    here() %>% 
    map(read_rds) %>% 
    bind_rows()
df %>% dim
## [1] 284807     31
set.seed(123)
idxs <- sample(nrow(df), size=round(0.1*nrow(df)))
train <- df[-idxs,]
test <- df[idxs,]
y_train <- train$Class
y_test <- test$Class
X_train <- train %>% select(-one_of(c("Time","Class")))
X_test <- test %>% select(-one_of(c("Time","Class")))
# Coerce the data frame to matrix to perform the training
X_train <- as.matrix(X_train)
X_test <- as.matrix(X_test)

one_of的好处在于当数据集不含有某个变量时,进行剔除,不会报错。

library(keras)
input_dim <- 29
outer_layer_dim <- 14
inner_layer_dim <- 7
input_layer <- layer_input(shape=c(input_dim))
encoder <- layer_dense(units=outer_layer_dim,
activation='relu')(input_layer)
encoder <- layer_dense(units=inner_layer_dim, activation='relu')(encoder)
decoder <- layer_dense(units=inner_layer_dim)(encoder)
decoder <- layer_dense(units=outer_layer_dim)(decoder)
decoder <- layer_dense(units=input_dim)(decoder)
autoencoder <- keras_model(inputs=input_layer, outputs = decoder)
autoencoder %>% 
    compile(optimizer='adam',
    loss='mean_squared_error',
    metrics=c('accuracy')
)
history <- 
    autoencoder %>% 
    fit(
        X_train,X_train,
        epochs = 10, batch_size = 32,
        validation_split=0.2
    )

跑的挺快的,不比 Xgboost 等低。

autoencoder %>% 
    save_model_hdf5(here("datasets/creditcard_autoencoders_model.h5"))
history %>% 
    write_rds(here("datasets/creditcard_autoencoders_history.rds"))
library(keras)
autoencoder <- load_model_hdf5(here("datasets/creditcard_autoencoders_model.h5"))
# history <- read_rds(here("datasets/creditcard_autoencoders_history.rds"))
autoencoder
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## input_8 (InputLayer)             (None, 29)                    0           
## ___________________________________________________________________________
## dense_21 (Dense)                 (None, 14)                    420         
## ___________________________________________________________________________
## dense_22 (Dense)                 (None, 7)                     105         
## ___________________________________________________________________________
## dense_23 (Dense)                 (None, 7)                     56          
## ___________________________________________________________________________
## dense_24 (Dense)                 (None, 14)                    112         
## ___________________________________________________________________________
## dense_25 (Dense)                 (None, 29)                    435         
## ===========================================================================
## Total params: 1,128
## Trainable params: 1,128
## Non-trainable params: 0
## ___________________________________________________________________________
# plot(history)
# Reconstruct on the test set
preds <- autoencoder %>% predict(X_test)
preds <- as.data.frame(preds)
y_preds <- ifelse(rowSums((preds-X_test)**2)/30<1,rowSums((preds-X_test)**2)/30,1)

这里数据处理逻辑为 如果预测值和真实值差异过大,则判断为1,反正为0,因此产生一个0-1之间的概率。 那么实际上就去验证这个预测概率的准确率。 即我们认为异常于样本分布的值,有多少是欺诈用户。

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred <- prediction(y_preds, y_test)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))

我们可以看到实际上效果还可以。

Liu, Yuxi (Hayden), and Pablo Maldonado. 2018. R Deep Learning Projects: Master the Techniques to Design and Develop Neural Network Models in R. Packt Publishing.