参考 Chollet and Allaire (2018, 187–90)

IMDB movie-review-classification 可以按照 RNN 这种有着记忆功能的模型处理。 这里假设了第\(t\)条对\(t+1\)条评论有作用。

我得先理解 imdb 这个数据集。

IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.


  1. 一个y,或是1,或是0
  2. 一行评论,进行了分词

The reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.


max_features <- 10000
maxlen <- 500
batch_size <- 32
# imdb <- dataset_imdb(num_words = max_features)
# library(readr)
# imdb %>% 
#     write_rds("datasets/dataset_imdb.rds")
imdb <- read_rds(here("datasets/dataset_imdb.rds"))
c(c(input_train, y_train), c(input_test, y_test)) %<-% imdb
input_train <- pad_sequences(input_train, maxlen = maxlen)
input_test <- pad_sequences(input_test, maxlen = maxlen)
input_train[[1]] %>% head
## [1] 0
pad_sequences(input_train, maxlen = maxlen) %>% .[1,1:6]
## [1] 0 0 0 0 0 0

maxlen: Cuts off texts after this many words (among the max_features most common words)

每个变量代表一个单词。 查看了pad_sequences {keras}解释,只保留前500个词。


y_train %>% head
## [1] 1 0 0 1 0 0
model <- 
    keras_model_sequential() %>%
    layer_embedding(input_dim = max_features, output_dim = 32) %>%
    layer_simple_rnn(units = 32) %>%
    layer_dense(units = 1, activation = "sigmoid")
model %>% 
        optimizer = "rmsprop",
        loss = "binary_crossentropy",
        metrics = c("acc")
history <- model %>% fit(
    input_train, y_train,
    epochs = 10,
    batch_size = 128,
    validation_split = 0.2
  1. 这里神经元缩减到32个,然后确定 state 在 32个。
  2. 时间花了不少,下次计算时间。
  3. 准确率不是很好,只有 84%,原因是
    1. 这个 text 过长,有10000个,
    2. feature 太少,只有高频的前500个
model %>% 
history %>% 
model <-  
history <- 

Part of the problem is that your inputs only consider the first 500 words, rather than full sequences—hence the RNN has access to less information than the earlier baseline model.


