A Minimal Example with Autoencoders

参考 Liu and Maldonado (2018, 103–8)

library(keras)
library(MASS)
library(tidyverse)
library(here)

Sigma <- matrix(c(1,0,0,1),2,2)
n_points <- 10000
df <- mvrnorm(n=n_points, rep(0,2), Sigma)
df <- as.data.frame(df)

The fraud rate is about 1–5%, so we generate 1% of our samples as coming from a normal distribution, with mean 5 and standard deviation 1

创建欺诈数据。

# Set the outliers
n_outliers <- as.integer(0.01*n_points)
idxs <- sample(n_points,size = n_outliers)
outliers <- mvrnorm(n=n_outliers, rep(5,2), Sigma)
df[idxs,] <- outliers
df %>% write_rds(here("datasets/sim_auto_encoders_data.rds"))

df <- read_rds(here("datasets/sim_auto_encoders_data.rds"))

Produces one or more samples from the specified multivariate normal distribution.

mvrnorm 不是很清楚。

plot(df)

input_layer <- layer_input(shape=c(2))
encoder <- layer_dense(units=1, activation='relu')(input_layer)
decoder <- layer_dense(units=2)(encoder)
autoencoder <- keras_model(inputs=input_layer, outputs = decoder)

autoencoder

Model
_________________________________________________________________________________________________________________
Layer (type)                                      Output Shape                                  Param #          
=================================================================================================================
input_3 (InputLayer)                              (None, 2)                                     0                
_________________________________________________________________________________________________________________
dense_5 (Dense)                                   (None, 1)                                     3                
_________________________________________________________________________________________________________________
dense_6 (Dense)                                   (None, 2)                                     4                
=================================================================================================================
Total params: 7
Trainable params: 7
Non-trainable params: 0
_________________________________________________________________________________________________________________

autoencoder %>% 
    compile(
        optimizer='adam',
        loss='mean_squared_error',
        metrics=c('accuracy')
    )

# Coerce the dataframe to matrix to perform the training
df <- as.matrix(df)
history <- 
    autoencoder %>% 
    fit(
        df,
        df,
        epochs = 30, 
        batch_size = 128,
        validation_split = 0.2
    )
history %>% 
    write_rds(here("datasets/sim_autoencoders_results.rds"))
autoencoder %>% 
    save_model_hdf5(here("datasets/sim_autoencoders_mode.h5"))

history <- read_rds(here("datasets/sim_autoencoders_results.rds"))
plot(history)

在第 13 次迭代时，已经出现过拟合的迹象了。

autoencoder <- load_model_hdf5(here("datasets/sim_autoencoders_mode.h5"))
preds <- 
    autoencoder %>% 
    predict(df %>% as.matrix) %>%
    `colnames<-`(c("V1", "V2")) %>% 
    as.data.frame()

# Coerce back the matrix to data frame to use ggplot later
df <- as.data.frame(df)
# Euclidean distance larger than 3 = sum of squares larger than 9
df$color <- ifelse((df$V1-preds$V1)**2+(df$V2-preds$V2)**2>9,"red","blue")

如果 pred 的坐标和原先坐标举例超过3，认为是计算错误。

library(ggplot2)
df %>% 
    ggplot(aes(V1,V2),col=df$color) + 
    geom_point(
        color = df$color
        ,position="jitter"
        ,alpha = 0.25
        )

显然还是有一些误差的。

We see that it learned correctly that most of the points come from a normal distribution centered at (0,0), as expected. However there are still some points which are normal on the original dataset and where pointed out as unusual. No need to get discouraged from learning autoencoders that soon, the reason for this is that the autoencoder we used is rather simple. We will look at more sophisticated ways to tackle the outlier detection problem with autoencoders.

autoencoders 学习了数据中的联合正太分布的特性，很好的识别了红色的、不属于这个分布的点，标记为异常值。同时我们发现，原本数据中有正常的数据，也就是聚集在蓝色星团附近的红色点，这是这个 autoencoders 过于简单导致的。

autoencoders 至少能够识别出分布，那么就可以作为无监督下的异常检查，效率执行超过聚类的算法，并且方便部署。因为聚类算法形成的空间，不能复现。

Liu, Yuxi (Hayden), and Pablo Maldonado. 2018. R Deep Learning Projects: Master the Techniques to Design and Develop Neural Network Models in R. Packt Publishing.

A Minimal Example with Autoencoders

李家翔

2019-01-23