1. 使用 RMarkdown 的 child 参数,进行文档拼接。
  2. 这样拼接以后的笔记方便复习。
  3. 相关问题提交到 Issue

情感分析算是一种监督学习,一般来说最简单的二分类(积极和不积极)问题,逻辑回归就可以处理完成。 又由于稀疏矩阵的问题,一般不用树模型,采用逻辑回归即可,因此整个工程,主要考虑在特征变量的维度上即可,也就是分词等操作。

1 词典法

1.1 主体

情感分析主要分三部分 emotion, subject, opinion holder

基本上可以用一个 apply 来完成的。

1.2 颗粒度

granularity gran-new-larry-ti

  1. document level => each review
  2. sentence-level => each sentence, positive? negative
  3. aspect level => the feature of the product

The camera of the phone is pretty good but the battery life is disappointing. (Misheva 2019)

这里考虑的是这个产品的不同方面,有点主题模型的感觉。

1.3 词典法

lexicon-based 分析,就是执行硬规则。

最后的方式是 lexicon 和 ml 都利用,hybird 是最好的!

The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. (Misheva 2019) https://textblob.readthedocs.io/en/dev/quickstart.html#tokenization (Misheva 2019)

image

image

polarity

  1. -1 negative
  2. 0 neutral
  3. +1 positive

the opening sentence of Catcher in the Rye has a negative score, whereas the one from Anna Karenina has a slightly positive one. (Misheva 2019)

2 特征工程

jieba 调用和 分词可以用 Python 完成,效率比 R 快很多。

2.1 文本语言判断

image

image

image

image

选择第一个,然后看下 count 就好了。

2.3 ngrams

之前都是 unigrams,现在要使用 bigrams,trigrams,n-grams。

image

image

太频繁和太少见都不好,进行剔除,这有点 tf-idf 的感觉。

image

image

实现这一步跟 jieba 没关系,是跟 CountVectorizer 有关,查看下相关函数。

这个函数本身包含了切分的能力。

为了好翻译成 SQL 语言啊,变量太多不行的。

limit the size of the vocabulary to include terms which occur in no more than 200 documents. (Misheva 2019)

3 RNNs

参考 王树义 (2019)

3.1 调参经验

I currently use dropout with a probability 50%, but this could perhaps be increased. The optimizer is Adam with the default parameters for Keras (http://keras.io/optimizers/#adam). (Stack Exchange Users 2019)

一般使用 dropout 来降低过拟合。

Reduce the number of hidden units, I know you said it already seems low, but given that the input layer only has 80 features, it actually can be that 128 is too much. (Stack Exchange Users 2019)

降低 LSTM 的 units 来处理过拟合。 LTSM units LTSM_units = 32,我的词向量已经有300了。 果然有用,降低到LTSM_units = 32后,好一点。

A rule of thumb is to have the number of hidden units be in-between the number of input units (80) and output classes (5) (Stack Exchange Users 2019)

经验法则来看,LSTM_units \(\in (y的分类数,词向量数)\) 我现在的是2~300,是合适的。

Alternatively, you could increase the dimension of the input representation space to be more than 80 (however this may overfit as well if the representation is already too narrow for any given word). (Stack Exchange Users 2019)

但是当增加词向量的维度,目前这个模型是给定的,所以不能改。

4 CNNs

参考 Brownlee (2016)

5 Bidirectional-LSTM

附录

5.1 中文分词

参考文献

Brownlee, Jason. 2016. “Sequence Classification with Lstm Recurrent Neural Networks in Python with Keras.” Machine Learning Mastery. 2016. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/.

Misheva, Violeta. 2019. “Sentiment Analysis in Python.” DataCamp. 2019. https://www.datacamp.com/courses/sentiment-analysis-in-python.

Stack Exchange Users. 2019. “Preventing Overfitting of Lstm on Small Dataset.” Cross Validated. 2019. https://stats.stackexchange.com/questions/204745/preventing-overfitting-of-lstm-on-small-dataset.

王树义. 2019. “如何用 Python 和循环神经网络(RNN)做中文文本分类?.” 玉树芝兰. 2019. https://mp.weixin.qq.com/s/ZreZPR8BN5EuDRN2t33dlQ.