参考 Chollet and Allaire (2018, 187–90)

IMDB movie-review-classification 可以按照 RNN 这种有着记忆功能的模型处理。 这里假设了第\(t\)条对\(t+1\)条评论有作用。

我得先理解 imdb 这个数据集。

IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.

数据集介绍,这是一个很两极化的评论数据。因此一个样本是

  1. 一个y,或是1,或是0
  2. 一行评论,进行了分词

The reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

词语已经被字典化,写成数字,方便内存减小。

library(keras)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.1.0     √ purrr   0.2.5
## √ tibble  2.0.1     √ dplyr   0.7.8
## √ tidyr   0.8.2     √ stringr 1.3.1
## √ readr   1.3.1     √ forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at D:/work/learn_nn
max_features <- 10000
maxlen <- 500
batch_size <- 32
cat("Loading data...\n")
## Loading data...
# imdb <- dataset_imdb(num_words = max_features)
# library(readr)
# imdb %>% 
#     write_rds("datasets/dataset_imdb.rds")
imdb <- read_rds(here("datasets/dataset_imdb.rds"))
c(c(input_train, y_train), c(input_test, y_test)) %<-% imdb
cat(length(input_train), "train sequences\n")
## 25000 train sequences
cat(length(input_test), "test sequences")
## 25000 test sequences
cat("Pad sequences (samples x time)\n")
## Pad sequences (samples x time)
input_train <- pad_sequences(input_train, maxlen = maxlen)
input_test <- pad_sequences(input_test, maxlen = maxlen)
cat("input_train shape:", dim(input_train), "\n")
## input_train shape: 25000 500
cat("input_test shape:", dim(input_test), "\n")
## input_test shape: 25000 500
input_train[[1]] %>% head
## [1] 0
pad_sequences(input_train, maxlen = maxlen) %>% .[1,1:6]
## [1] 0 0 0 0 0 0

maxlen: Cuts off texts after this many words (among the max_features most common words)

每个变量代表一个单词。 查看了pad_sequences {keras}解释,只保留前500个词。

但是验证后发现字典进行了修改。

input_train[1:6,1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
## [4,]  687   23    4    2 7339    6
## [5,]    0    0    0    0    0    0
## [6,]    0    0    0    0    0    0
y_train %>% head
## [1] 1 0 0 1 0 0
model <- 
    keras_model_sequential() %>%
    layer_embedding(input_dim = max_features, output_dim = 32) %>%
    layer_simple_rnn(units = 32) %>%
    layer_dense(units = 1, activation = "sigmoid")
model %>% 
    compile(
        optimizer = "rmsprop",
        loss = "binary_crossentropy",
        metrics = c("acc")
)
history <- model %>% fit(
    input_train, y_train,
    epochs = 10,
    batch_size = 128,
    validation_split = 0.2
)
  1. 这里神经元缩减到32个,然后确定 state 在 32个。
  2. 时间花了不少,下次计算时间。
  3. 准确率不是很好,只有 84%,原因是
    1. 这个 text 过长,有10000个,
    2. feature 太少,只有高频的前500个
model %>% 
    save_model_hdf5(here("datasets/imdb_rnn_model.h5"))
history %>% 
    write_rds(here("datasets/imdb_rnn_history.rds"))
library(keras)
library(here)
library(tidyverse)
model <-  
    load_model_hdf5(here("datasets/imdb_rnn_model.h5"))
history <- 
    read_rds(here("datasets/imdb_rnn_history.rds"))
plot(history)

Part of the problem is that your inputs only consider the first 500 words, rather than full sequences—hence the RNN has access to less information than the earlier baseline model.

变量太少了,因此效果不好。

Chollet, François, and J.J. Allaire. 2018. Deep Learning with R. Manning Publications.