参考 Chollet and Allaire (2018, 190–95)

library(knitr)
library(here)
## here() starts at D:/work/learn_nn
include_graphics(here("figure/ltsm-info-pipe.png"))

相比较于简单的RNN,LTSM多加了一层信息流, 并且在每次 for loop中,体现了矩阵运算上(\(V_o + c_t\))。 完成了简单的RNN后,发现其实结果并不理想,因此需要查找问题。

include_graphics(here("figure/ltsm-c.png"))

One major issue with layer_simple_rnn is that although it should theoretically be able to retain at time t information about inputs seen many timesteps before, in practice, such long-term dependencies are impossible to learn. This is due to the , an vanishing gradient problem effect that is similar to what is observed with non-recurrent networks (feedforward networks) that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable.

RNN 的一个问题是时间的相关性太长了,常会出现梯度消失的情况,因此采用 LTSM 和 GRU 代替。

This is essentially what LSTM does: it saves information for later, thus preventing older signals from gradually vanishing during processing.

Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time, thus fighting the vanishing-gradient problem.

Refresh old memory. LTSM 的出现就是为了解决梯度消失这个问题。

下面开始举例。

library(keras)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.1.0     √ purrr   0.2.5
## √ tibble  2.0.1     √ dplyr   0.7.8
## √ tidyr   0.8.2     √ stringr 1.3.1
## √ readr   1.3.1     √ forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
max_features <- 10000
maxlen <- 500
batch_size <- 32
cat("Loading data...\n")
## Loading data...
# imdb <- dataset_imdb(num_words = max_features)
# library(readr)
# imdb %>% 
#     write_rds("datasets/dataset_imdb.rds")
imdb <- read_rds(here("datasets/dataset_imdb.rds"))
c(c(input_train, y_train), c(input_test, y_test)) %<-% imdb
cat(length(input_train), "train sequences\n")
## 25000 train sequences
cat(length(input_test), "test sequences")
## 25000 test sequences
cat("Pad sequences (samples x time)\n")
## Pad sequences (samples x time)
input_train <- pad_sequences(input_train, maxlen = maxlen)
input_test <- pad_sequences(input_test, maxlen = maxlen)
cat("input_train shape:", dim(input_train), "\n")
## input_train shape: 25000 500
cat("input_test shape:", dim(input_test), "\n")
## input_test shape: 25000 500
model <- 
    keras_model_sequential() %>%
    layer_embedding(input_dim = max_features, output_dim = 32) %>%
    layer_lstm(units = 32) %>%
    layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
    optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = c("acc")
)
history <- 
    model %>% fit(
    input_train, y_train,
    epochs = 10,
    batch_size = 128,
    validation_split = 0.2
)
model %>% 
    save_model_hdf5(here("datasets/imdb_ltsm_model.h5"))
history %>% 
    write_rds(here("datasets/imdb_ltsm_history.rds"))
library(keras)
library(here)
library(tidyverse)
model <-  
    load_model_hdf5(here("datasets/imdb_ltsm_model.h5"))
history <- 
    read_rds(here("datasets/imdb_ltsm_history.rds"))
plot(history)

This time, you achieve up to 88% validation accuracy. Not bad: certainly much better than the simple RNN network—that’s largely because LSTM suffers much less from the vanishing-gradient problem.

因为 LTSM 解决了梯度消失的问题,因此表现更好了些了。

But honestly, the primary reason is that analyzing the global, long-term structure of the reviews (what LSTM is good at) isn’t helpful for a sentiment-analysis problem. Such a basic problem is well solved by looking at what words occur in each review, and at what frequency.

正如之前提到,这不是一个时序问题,因此 LTSM 和 RNN 并不能很好的解决问题,而且去看词汇的频率等其他特征变量会更有效,参考nlp_minimal_example

Chollet, François, and J.J. Allaire. 2018. Deep Learning with R. Manning Publications.