使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue

情感分析算是一种监督学习，一般来说最简单的二分类(积极和不积极)问题，逻辑回归就可以处理完成。又由于稀疏矩阵的问题，一般不用树模型，采用逻辑回归即可，因此整个工程，主要考虑在特征变量的维度上即可，也就是分词等操作。

1 词典法

1.1 主体

情感分析主要分三部分 emotion, subject, opinion holder

length_reviews = movies.review.str.len()

# How long is the longest review
print(max(length_reviews))
print(min(length_reviews))

<script.py> output:
    5992

<script.py> output:
    53

基本上可以用一个 apply 来完成的。

1.2 颗粒度

granularity gran-new-larry-ti

document level => each review
sentence-level => each sentence, positive? negative
aspect level => the feature of the product

The camera of the phone is pretty good but the battery life is disappointing. (Misheva 2019)

这里考虑的是这个产品的不同方面，有点主题模型的感觉。

1.3 词典法

lexicon-based 分析，就是执行硬规则。

最后的方式是 lexicon 和 ml 都利用，hybird 是最好的！

The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. (Misheva 2019) https://textblob.readthedocs.io/en/dev/quickstart.html#tokenization (Misheva 2019)

image

polarity

-1 negative
0 neutral
+1 positive

# Import the required packages
from textblob import TextBlob

# Create a textblob object 
blob_annak = TextBlob(annak)
blob_catcher = TextBlob(catcher)

# Print out the sentiment   
print('Sentiment of annak: ', blob_annak.sentiment)
print('Sentiment of catcher: ', blob_catcher.sentiment)

<script.py> output:
    Sentiment of annak:  Sentiment(polarity=0.05000000000000002, subjectivity=0.95)
    Sentiment of catcher:  Sentiment(polarity=-0.05, subjectivity=0.5466666666666666)

the opening sentence of Catcher in the Rye has a negative score, whereas the one from Anna Karenina has a slightly positive one. (Misheva 2019)

In [1]: titanic.__class__
Out[1]: str

1.4 wordcloud

In [1]: east_of_eden.__class__
Out[1]: str

from wordcloud import WordCloud

# Generate the word cloud from the east_of_eden string
cloud_east_of_eden = WordCloud(background_color="white").generate(east_of_eden)

# Create a figure of the generated cloud
plt.imshow(cloud_east_of_eden, interpolation='bilinear')  
plt.axis('off')
# Display the figure
plt.show()

image

Recall that the interpolation argument makes the word cloud appear more smoothly. (Misheva 2019)

词云一般作为可视化、EDA分析。

2 特征工程

jieba 调用和分词可以用 Python 完成，效率比 R 快很多。

2.1 文本语言判断

image

选择第一个，然后看下 count 就好了。

In [2]: foreign
Out[2]: 'La histoire rendu étai fidèle, excellent, et grand.'

# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))

<script.py> output:
    [fr:0.9999978026682087]

from langdetect import detect_langs

languages = []

# Loop over the sentences in the list and detect their language
for sentence in range(len(sentences)):
    languages.append(detect_langs(sentences[sentence]))
    
print('The detected languages are: ', languages)

['La histoire rendu étai fidèle, excellent, et grand.', 'Excelente muy recomendable.', 'It had a leak from day one but the return and exchange process was very quick.']

<script.py> output:
    The detected languages are:  [[fr:0.9999960582179737], [es:0.9999960229555345], [en:0.9999979325190786]]

from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(non_english_reviews)):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] = languages

print(non_english_reviews.head())

<script.py> output:
          score                                             review language
    1249      1   Il grande ritorno!: E' dai tempi del tour di ...       it
    1259      1   La reencarnación vista por un científico: El ...       es
    1260      1   Excelente Libro / Amazing book!!: Este libro ...       es
    1261      1   Magnifico libro: Brian Weiss ha dejado una ma...       es
    1639      1   El libro mas completo que existe para nosotra...       es

2.2 BoW

BoW 是 Bags of Words 词包的简称，对于词汇的学习是不直接考虑词序的。

Remember that the output of a CountVectorizer() is a sparse matrix, which stores only entries which are non-zero. (Misheva 2019)

sklearn.feature_extraction.text.CountVectorizer 反馈的对象是一个稀疏矩阵。

2.2.1 转为 np.array

# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

<script.py> output:
    [[1 1 1 0 1 0 1 0 0 0 0 0 0]
     [0 0 0 1 0 1 0 1 1 1 1 2 1]]

2.2.2 转为 pd.df

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

<script.py> output:
       about  after  all  also  am  ...  with  work  would  you  your
    0      0      0    1     0   0  ...     1     0      2    0     1
    1      0      0    0     0   0  ...     0     0      1    1     0
    2      0      0    3     0   0  ...     0     1      1    2     0
    3      0      0    0     0   0  ...     0     0      0    0     0
    4      0      1    0     0   0  ...     0     0      0    3     1
    
    [5 rows x 100 columns]

2.3 ngrams

之前都是 unigrams，现在要使用 bigrams，trigrams,n-grams。

image

太频繁和太少见都不好，进行剔除，这有点 tf-idf 的感觉。

image

vect = CountVectorizer(ngram_range=(1,2))

实现这一步跟 jieba 没关系，是跟 CountVectorizer 有关，查看下相关函数。

In [2]: reviews.review.head()
Out[2]: 
0     Stuning even for the non-gamer: This sound tr...
1     The best soundtrack ever to anything.: I'm re...
2     Amazing!: This soundtrack is my favorite musi...
3     Excellent Soundtrack: I truly like this sound...
4     Remember, Pull Your Jaw Off The Floor After H...
Name: review, dtype: object

这个函数本身包含了切分的能力。

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1,2))
# using uni- and bigrams
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

为了好翻译成 SQL 语言啊，变量太多不行的。

limit the size of the vocabulary to include terms which occur in no more than 200 documents. (Misheva 2019)

2.3.1 document frequency

from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

2.3.2 binary

binary : boolean, default=False
    If True, all non zero counts are set to 1. This is useful for discrete
    probabilistic models that model binary events rather than integer
    counts.

这个参数降低了正则化翻译的难度。

from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

看起来没有限制 word freq 的。

2.3.3 综合

#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

2.4 特征工程

2.4.1 句子长度

image

tokenizers 用 list comprehension 完成。

image

list of list

image

n_tokens 也是一个特征变量啊！

image

2.4.2 标点符号

n_punctuation 可以判断情感。

# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

jieba 的替代品。

In [3]: GoT
Out[3]: 'Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armour yourself in it, and it will never be used to hurt you.'

In [6]: avengers.__class__
Out[6]: list

In [7]: len(avengers)
Out[7]: 3

# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

In [9]: avengers
Out[9]: 
["Cause if we can't protect the Earth, you can be d*** sure we'll avenge it",
 'There was an idea to bring together a group of remarkable people, to see if we could become something more',
 "These guys come from legend, Captain. They're basically Gods."]

# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens

这个改良为 list comprehension

3 RNNs

参考王树义 (2019)

3.1 调参经验

I currently use dropout with a probability 50%, but this could perhaps be increased. The optimizer is Adam with the default parameters for Keras (http://keras.io/optimizers/#adam). (Stack Exchange Users 2019)

一般使用 dropout 来降低过拟合。

Reduce the number of hidden units, I know you said it already seems low, but given that the input layer only has 80 features, it actually can be that 128 is too much. (Stack Exchange Users 2019)

降低 LSTM 的 units 来处理过拟合。 LTSM units LTSM_units = 32，我的词向量已经有300了。果然有用，降低到LTSM_units = 32后，好一点。

A rule of thumb is to have the number of hidden units be in-between the number of input units (80) and output classes (5) (Stack Exchange Users 2019)

经验法则来看，LSTM_units \(\in (y的分类数，词向量数)\) 我现在的是2~300，是合适的。

Alternatively, you could increase the dimension of the input representation space to be more than 80 (however this may overfit as well if the representation is already too narrow for any given word). (Stack Exchange Users 2019)

但是当增加词向量的维度，目前这个模型是给定的，所以不能改。

4 CNNs

参考 Brownlee (2016)

5 Bidirectional-LSTM

附录

5.1 中文分词

参考文献

Brownlee, Jason. 2016. “Sequence Classification with Lstm Recurrent Neural Networks in Python with Keras.” Machine Learning Mastery. 2016. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/.

Misheva, Violeta. 2019. “Sentiment Analysis in Python.” DataCamp. 2019. https://www.datacamp.com/courses/sentiment-analysis-in-python.

Stack Exchange Users. 2019. “Preventing Overfitting of Lstm on Small Dataset.” Cross Validated. 2019. https://stats.stackexchange.com/questions/204745/preventing-overfitting-of-lstm-on-small-dataset.

王树义. 2019. “如何用 Python 和循环神经网络（RNN）做中文文本分类？.” 玉树芝兰. 2019. https://mp.weixin.qq.com/s/ZreZPR8BN5EuDRN2t33dlQ.

情感分析学习笔记

情感分析学习笔记

1 词典法

1.1 主体

1.2 颗粒度

1.3 词典法

1.4 wordcloud

2 特征工程

2.1 文本语言判断

2.2 BoW

2.2.1 转为 np.array

2.2.2 转为 pd.df

2.3 ngrams

2.3.1 document frequency

2.3.2 binary

2.3.3 综合

2.4 特征工程

2.4.1 句子长度

2.4.2 标点符号

3 RNNs

3.1 调参经验

4 CNNs

5 Bidirectional-LSTM

附录

5.1 中文分词

参考文献

情感分析 学习笔记

1 词典法

1.1 主体

1.2 颗粒度

1.3 词典法

1.4 wordcloud

2 特征工程

2.1 文本语言判断

2.2 BoW

2.2.1 转为 np.array

2.2.2 转为 pd.df

2.3 ngrams

2.3.1 document frequency

2.3.2 binary

2.3.3 综合

2.4 特征工程

2.4.1 句子长度

2.4.2 标点符号

3 RNNs

3.1 调参经验

4 CNNs

5 Bidirectional-LSTM

附录

5.1 中文分词

参考文献

情感分析学习笔记