使用 RMarkdown 的 child 参数，进行文档拼接。
这样拼接以后的笔记方便复习。
相关问题提交到 Issue

1 综述

参考机器学习算法与自然语言处理 (2019)

神经语言模型在训练过程中，不仅学习到预测下一个词的概率分布，同时也得到了一个副产品：词嵌入表示。

语言模型产出词的概率分布和词嵌入。

相比随机初始化的词嵌入，模型训练完成后的词嵌入已经包含了词汇之间的信息。2013年，Mikolov等人提出了word2vec工具，其中包含了CBOW（Continue Bag of Words）模型和Skip-gram模型[2-3]，该工具仅仅利用海量的单语数据，通过无监督的方法训练得到词嵌入。

词嵌入本身具有局限性，最主要的缺点是无法解决一词多义问题，不同的词在不同的上下文中会有不同的意思，而词嵌入对模型中的每个词都分配了一个固定的表示。

词嵌入的问题是无法处理一词多义的问题。

Peters等人提出了ELMo（Embedding from Language Model）[4]，即使用语言模型来获取深层的上下文表示。ELMo的具体做法是，基于每个词所在的上下文，利用双向LSTM的语言模型来获取这个词的表示。ELMo的方法能够提取丰富的特征给下游任务使用，但是ELMo仅仅进行特征提取而没有预训练整个网络，远远没有发挥预训练的潜力，另外一个不足之处是，自注意力机制的Transformer模型结构，相比LSTM能够更有效地捕获长距离依赖，对句子中信息进行更充分的建模。

双向LSTM可以拿上下文，就是 RNNs。

GPT将LSTM换成了Transformer,获得了更高的成绩，但是由于使用的是单向模型，只能通过前面词预测后面的词，可能会遗漏信息。

GPT: generative pre training
bert bidirectional 是双向的。

Devlin等人提出了BERT（Bidirectional Encoder Representations from Transformers）[6]，即基于Transformer的双向编码器表示。 BERT提出后大火，也许是因为BERT的效果太好。目前绝大多数的预训练模型都是在BERT上改造而来。清华大学的王晓智和张正彦同学给出了目前的预训练模型关系图，这里引用一下，如下图所示：

如果没有超越Transformer的特征提取模型提出的话，

现在 transformer 是比词向量的更好特征提取方式。

既然各类不同的任务对预训练模型进行不同的改造，那能不能有一种预训练模型，可以适应全部的NLP任务呢，近期谷歌刚刚发布了超大规模的T5（NLP Text-to-Text）[11]预训练模型。它给整个 NLP 预训练模型领域提供了一个通用框架，把所有任务都转化成一种形式，无论什么任务，直接拿来一个超大预训练模型，然后主要工作就变成了怎么把任务转换成合适的文本输入输出，，比如德英翻译，只需将训练数据集的输入部分前加上“translate German to English”。另外的思路是，不是改造预训练的模型，而是将庞大的预训练模型进行压缩，比如近期的alBERT，通过共享参数、引入单独的词嵌入层维度来减少BERT的参数。最终性能也登上GLUE第一（刚刚被T5超过）。还有通过知识蒸馏技术，训练出的tinyBERT等，都是对BERT模型进行的模型压缩。

目前 T5 是更强的模型。

2 LSTM 直观理解

我们从一个思考实验开始。当你在网络上购买生活用品时，一般会先阅读商品评论来判断商品好坏，以确定是否要购买这个商品。 (Nguyen 2018) 当你查看评论时，你的大脑下意识地只会记住重要的关键词。你会选择“amazing”和“perfectly balanced breakfast”这样的词汇，而不太关心“this”，“give”，“all”，“should”等字样。如果有人第二天问你评论内容，你可能不会一字不漏地记住它，而是记住了主要观点，比如“下次一定还来买”，一些次要内容自然会从记忆中逐渐消失。 (Nguyen 2018) 在这种情况下，你记住的这些词能判定了这个餐厅的好坏。这基本上就是LSTM或GRU的作用，它可以学习只保留相关信息来进行预测，并忘记不相关的数据。 (Nguyen 2018)

这算是 LSTM 最直接认识。而且也对词语给予了权重。所以除了并行以外，我觉得 transformer 并没有强在哪?

因此，在RNN中，梯度小幅更新的网络层会停止学习，这些通常是较早的层。由于这些层不学习，RNN无法记住它在较长序列中学习到的内容，因此它的记忆是短期的。 (Nguyen 2018)

所以 RNN 也可以叫做 STM

在处理时，它把先前的隐藏状态传递给序列的下一步，其中隐藏状态作为神经网络记忆，它包含相关网络已处理数据的信息。 (Nguyen 2018)

这些隐藏层，其实是矩阵，张量，通过梯度下降优化而成的。这样把每个训练集样本输入进来，进行“特征工程” (类似于 one-hot 后的词向量处理)，而且这个特征工程是通过损失函数进行控制得到的。

这样看其实比较理解了 RNN 的隐藏层传递。

下面来介绍RNN中每个cell单元是如何计算隐藏状态的。首先，将输入和先前隐藏状态组合成一个向量，向量中含有当前输入和先前输入的信息。这个向量再经过激活函数Tanh后，输出新的隐藏状态，或网络记忆。 (Nguyen 2018)

这就具备了 arima 的统计性质。

门结构中包含Sigmoid函数，这个激活函数与Tanh函数类似。但它的输出区间不是(-1, 1)，而是(0, 1)，这有助于更新或忘记数据，因为任何数字乘以0都为0，这部分信息会被遗忘。同样，任何数字乘以1都为相同值，这部分信息会完全保留。通过这样，网络能了解哪些数据不重要需要遗忘，哪些数字很重要需要保留。 (Nguyen 2018)

LSTM 中的遗忘门(t-h的信息)和输入门(t的信息)就是通过 Sigmoid 函数进行遗忘的。

遗忘门能决定应丢弃或保留哪些信息。来自先前隐藏状态的信息和当前输入的信息同时输入到Sigmoid函数，输出值处于0和1之间，越接近0意味着越应该忘记，越接近1意味着越应该保留。 (Nguyen 2018)

0和1的选取，取决于矩阵运算、张量运算，通过梯度下降优化而成的。这样把每个训练集样本输入进来，进行“特征工程” (类似于 one-hot 后的词向量处理)，而且这个特征工程是通过损失函数进行控制得到的。

这里总结下，遗忘门能决定需要保留先前步长中哪些相关信息，输入门决定在当前输入中哪些重要信息需要被添加，输出门决定了下一个隐藏状态。 (Nguyen 2018)

输出门 = 输入门 + 遗忘门

—- machine translation —-

3 SMT

\[\begin{align} \label{smt} \operatorname{argmax}_{y} P(y | x)=\operatorname{argmax}_{y} P(x | y) P(y) \tag{3.1}\\ \end{align}\]

Figure 3.1: SMT 也就是 Statistical Machine Translation，是最早期的。根据公式 (3.1) ，可以进行分工，也表达了 transition model 和 language model 的区别。左边更加关注学习翻译，右边更加关注如何写好英语。

Figure 3.2: 平行数据找好，就可以翻译了，因此不需要学习这门语言，即研究语言模型，单独训练翻译能力即可。

基本上是一个树结构，选择概率最大的情况。因此这种方法比较偏向传统的统计学，人工特征工作很重。

Figure 3.3: 基本上是一个树结构，选择概率最大的情况。因此这种方法比较偏向传统的统计学，人工特征工作很重。

\[\begin{align} \label{beamsize} \operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \tag{3.2} \end{align}\]

$不可能列举全部的y可能性，所以剔除概率小的情况，进行快速搜索。虽然不一定是最优，但是效率会提高很多。控制 beam size 一般 nmt 在5-10。因此如公式(3.2) 这会修改$P_{\mathrm{LM}}$的分布。$

Figure 3.4: 不可能列举全部的y可能性，所以剔除概率小的情况，进行快速搜索。虽然不一定是最优，但是效率会提高很多。控制 beam size 一般 nmt 在5-10。因此如公式(3.2) 这会修改$P_{\mathrm{LM}}$的分布。

\[\begin{align} \label{beamsizek2} \text { Beam size }=\mathrm{k}=2 . \text { Blue numbers }=\text { score }\left(y_{1}, \ldots, y_{t}\right)=\sum_{i=1}^{t} \log P_{\text {Lur }}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \tag{3.3} \end{align}\]

Figure 3.5: 如每一层保持最好的两个往下走，然后寻找到最优，参考公式(3.3)。

Figure 3.6: 开始和停止状态取决于和，这就是大部分 NLP 模型需要考虑的。

Figure 3.7: beam search 的问题，越长分数越低，返回去看树就发现的确是这样的，那么就求个平均，处理“越长分数越低”的问题。

4 NMT

\[\begin{align} \label{nmt} P(y | x)=P\left(y_{1} | x\right) P\left(y_{2} | y_{1}, x\right) P\left(y_{3} | y_{1}, y_{2}, x\right) \ldots P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right) \tag{4.1} \end{align}\]

Figure 4.1: NMT 理解更简单，就一个单独的神经网络，而且就是两个 RNNS 串联起来，NMT 参考公式(4.1)。这里相比较于 smt 是系统性学习，并没有分工，分工局部最优但是不是全局最优。

Figure 4.2: 这里LSTM可以单向，双向，多层都可以。这里是包含了词向量。argmax 这是 smt 的假设。

\[\begin{align} \label{nmtlossfunc} J=\frac{1}{T} \sum_{t=1}^{T} I_{t} \tag{4.2}\\ \end{align}\]

Figure 4.3: 根据公式(4.2)损失函数定义。

(ref:20200401010346-plot) NMT 相比较于 SMT，只需要训练一个神经网络，完成端到端服务，并且减少了很多人工特征的过程，并且可以制造任何语言间的通法。因此可读性差，出现了bug，难得以最小成本解决，基本上需要重新训练模型了，难以控制，比如出现敏感词等。

Figure 4.4: (ref:20200401010346-plot)

Figure 4.5: BLEU 同音于 BLUE，衡量语言模型是否好不好，并且依赖人工的如果机器和人工翻译都有效，只是不同，往往拿到一个不好的成绩。

Figure 4.6: NMT 发展时间好快，基本节点就是2014、2016年两个节点，2016年在BLEU上成功超越SMT，投入到实际生产中。

Figure 4.7: Domain Mismatch 很好理解，比如训练的时候用维基百科，但是预测微博，这不恰当。但是维基百科还是效率很高的文本，是志愿者维护的，多语言训练文本。

Figure 4.8: BIAS 是 NMT、SMT 都具备的问题，比如他、她的预测，存在 bias。

Figure 4.9: 这是一个很好的例子，当模型不可理解时，容易出现的锅，这是 NMT 相对于 SMT 不可控制的一点。这里的问题就是当训练集比如圣经等，然后预测日常对话，就会出现这样的状况。

5 Attention

5.1 直接理解

I want to eat apple. (青青山螺应如是 2019a, 2019b)

我想吃苹果. (青青山螺应如是 2019a, 2019b)

按照分心模型的逻辑，生成吃和苹果这两个单词，源语句中的eat影响力是一样的，这显然不科学，eat肯定对生成吃这个字影响更多一些，apple则弱些，他们的注意力分配应该是不一样的。 (青青山螺应如是 2019a, 2019b)

这是注意力模型的直观理解。

Attention给模型赋予了区分辨别的能力，例如，在机器翻译、语音识别应用中，为句子中的每个词赋予不同的权重，使神经网络模型的学习变得更加灵活（soft），同时Attention本身可以做为一种对齐关系，解释翻译输入/输出句子之间的对齐关系，解释模型到底学到了什么知识，为我们打开深度学习的黑箱，提供了一个窗口。 (青青山螺应如是 2019a, 2019b)

Figure 5.1: soft 指的是每个词的权重。

颜色越黄绿，代表英语单词对应德语单词分配的注意力系数越大。 (青青山螺应如是 2019a, 2019b)

Figure 5.2: 这样就把这种赋予的权重解释很清楚了。这个对齐关系同时也可以可视化。

5.2 快速解释

Figure 5.3: 在 encoder 部分，这个 RNN 提取信息的能力不够，因此在 NMT 的基础上，2019年提出了 attention 机制，attention 机制关注句子的某一部分，会关注 RNN 里面句子成分中的概率分布。

Figure 5.4: 这是一个很好的例子，比如 decoder 中，第一个词为 she 还是 he，在 encoder 部分进行概率分布查看。

Figure 5.5: 然后这个结果通过 concatenate 进入 decoder，然后进行损失函数的梯度下降。

Figure 5.6: decoder 的每个词 attention 是不一样的，因此核心理解 attention 做了一步 dot product 影响了梯度下降的速度，增加了一个隐藏向量。

We have encoder hidden states $h_{1}, \ldots, h_{N} \in \mathbb{R}^{h}$ On timestep $t$, we have decoder hidden state $s_{t} \in \mathbb{R}^{h}$ We get the attention scores $e^{t}$ for this step: \[ e^{t}=\left[s_{t}^{T} h_{1}, \ldots, s_{t}^{T} h_{N}\right] \in \mathbb{R}^{N} \tag{5.1} \] We take softmax to get the attention distribution $\alpha^{t}$ for this step (this is a probability distribution and sums to 1 ) \[ \alpha^{t}=\operatorname{softmax}\left(e^{t}\right) \in \mathbb{R}^{N} \tag{5.2} \] We use $\alpha^{t}$ to take a weighted sum of the encoder hidden states to get the attention output $\boldsymbol{a}_{t}$ \[ \boldsymbol{a}_{t}=\sum_{i=1}^{N} \alpha_{i}^{t} \boldsymbol{h}_{i} \in \mathbb{R}^{h} \tag{5.3} \]

这里也可以说每个 decoder 的词拥有 encoder 每个词的分布，这个概率分布体现了重要性。这样理解很多，就是在 encoder 和 decoder 加了一个词的分布，然后点乘到 encoder 上。公式 (5.1) 决定了注意力层的形状和 encoder 同尺寸，公式 (5.2) 决定了标准化后，这是一个求和为1的分布，公式 (eq:attentionoutput) 表示注意力层的尺寸可以和 encoder 层直接 concatenate。

Figure 5.7: 这里也可以说每个 decoder 的词拥有 encoder 每个词的分布，这个概率分布体现了重要性。这样理解很多，就是在 encoder 和 decoder 加了一个词的分布，然后点乘到 encoder 上。公式 (5.1) 决定了注意力层的形状和 encoder 同尺寸，公式 (5.2) 决定了标准化后，这是一个求和为1的分布，公式 (eq:attentionoutput) 表示注意力层的尺寸可以和 encoder 层直接 concatenate。

Figure 5.8: 加入了概率分布后，的确存在影响梯度下降，或者说更快有引导的去梯度下降。并且这个注意力词的分布，正好可以用来作为可视化，加强理解了。

Figure 5.9: attention 的 general 定义，一个 value 根据 query 的分布。类似于lstm 新加入的遗忘门，进行了信息筛选。

Multiplicative attention: $e_{i}=s^{T} W h_{i} \in \mathbb{R}$ Where $\boldsymbol{W} \in \mathbb{R}^{d_{2} \times d_{1}}$ is a weight matrix

Additive attention: $e_{i}=\boldsymbol{v}^{T} \tanh \left(\boldsymbol{W}_{1} \boldsymbol{h}_{i}+\boldsymbol{W}_{2} \boldsymbol{s}\right) \in \mathbb{R}$ Where $\boldsymbol{W}_{1} \in \mathbb{R}^{d_{3} \times d_{1}}, \boldsymbol{W}_{2} \in \mathbb{R}^{d_{3} \times d_{2}}$ are weight matrices and $\boldsymbol{v} \in \mathbb{R}^{d_{3}}$ is a weight vector. $d_{3}$ (the attention dimensionality) is a hyperparameter

Figure 5.10: 由于注意力机制的核心是加入的词的概率分布，因此有各种变形，如加入一个 weight matrix 来确定参数和权重。

5.3 文献阅读

In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

segment 本身就是一个很损失信息的方式。

Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

这里的 hard segment 和 (soft-)alignments 对应。

5.3.1 INTRODUCTION

A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho et al. (2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.

训练文本长度会限制在测试集上的表现。

5.3.2 BACKGROUND

This neural machine translation approach typ-ically consists of two components, the first of which encodes a source sentence x and the second decodes to a target sentence y.

x -> encode -> decode -> y

In the Encoder-Decoder framework, an encoder reads the input sentence, a sequence of vectors $\mathbf{x}=\left(x_{1}, \cdots, x_{T_{x}}\right),$ into a vector $c_{\mathcal{U}}^{2}$ The most common approach is to use an $\mathrm{RNN}$ such that \[ h_{t}=f\left(x_{t}, h_{t-1}\right) \] and \[ c=q\left(\left\{h_{1}, \cdots, h_{T_{x}}\right\}\right) \] where $h_{t} \in \mathbb{R}^{n}$ is a hidden state at time $t,$ and $c$ is a vector generated from the sequence of the hidden states. $f$ and $q$ are some nonlinear functions. Sutskever $e t$ al. (2014) used an $L S T M$ as $f$ and $q\left(\left\{h_{1}, \cdots, h_{T}\right\}\right)=h_{T},$ for instance.

encoder 就是一个 RNN，然后没有介入一个全连接层加上 softmax 做多分类。

The decoder is often trained to predict the next word $y_{t^{\prime}}$ given the context vector $c$ and all the previously predicted words $\left\{y_{1}, \cdots, y_{t^{\prime}-1}\right\} .$ In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals: \[ p(\mathbf{y})=\prod_{t=1}^{T} p\left(y_{t} \mid\left\{y_{1}, \cdots, y_{t-1}\right\}, c\right) \] where $\mathbf{y}=\left(y_{1}, \cdots, y_{T_{y}}\right) .$ With an RNN, each conditional probability is modeled as \[ p\left(y_{t} \mid\left\{y_{1}, \cdots, y_{t-1}\right\}, c\right)=g\left(y_{t-1}, s_{t}, c\right) \tag{5.4} \] where $g$ is a nonlinear, potentially multi-layered, function that outputs the probability of $y_{t},$ and $s_{t}$ is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an $\mathrm{RNN}$ and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013 ).

decoder 中，输出为$y_{t}$，输入为三部分滞后期$y_{t-1}$、RNN 的隐藏层$s_t$、前面的 $c$。

5.3.3 ALIGNMENT

In a new model architecture, we define each conditional probability in Eq. \[ p\left(y_{i} \mid y_{1}, \ldots, y_{i-1}, \mathbf{x}\right)=g\left(y_{i-1}, s_{i}, c_{i}\right) \] where $s_{i}$ is an RNN hidden state for time $i,$ computed by \[ s_{i}=f\left(s_{i-1}, y_{i-1}, c_{i}\right) \] It should be noted that unlike the existing encoder-decoder approach (see Eq. (2) ), here the probability is conditioned on a distinct context vector $c_{i}$ for each target word $y_{i}$

公式 (5.4) 中 c 是恒定的，by i 来说，每个 y 都是一样的。

The context vector $c_{i}$ depends on a sequence of annotations $\left(h_{1}, \cdots, h_{T_{x}}\right)$ to which an encoder maps the input sentence. Each annotation $h_{i}$ contains information about the whole input sequence with a strong focus on the parts surrounding the $i$ -th word of the input sequence. We explain in detail how the annotations are computed in the next section.

c 是来源于 encoder 的，就是 attention 机制的体现。

The context vector $c_{i}$ is, then, computed as a weighted sum of these annotations $h_{i}:$ \[ c_{i}=\sum_{j=1}^{T_{x}} \alpha_{i j} h_{j} \]

这部分就是非常熟悉的内容了。$c_{i}$就是度量了 attention。

The weight $\alpha_{i j}$ of each annotation $h_{j}$ is computed by \[ \alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_{x}} \exp \left(e_{i k}\right)} \tag{5.5} \] where \[ e_{i j}=a\left(s_{i-1}, h_{j}\right) \tag{5.6} \] is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match. The score is based on the RNN hidden state $s_{i-1}$ (just before emitting $y_{i},$ Eq. (4) ) and the $j-$ th annotation $h_{j}$ of the input sentence. We parametrize the alignment model $a$ as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, the alignment is not considered to be a latent variable. Instead, the alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly.

注意是 $s_{t-1}$ 而非 $s_t$，因为还是预测 $s_t$，所以技术 trick 用 $s_{t-1}$。

We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments. Let $\alpha_{i j}$ be a probability that the target word $y_{i}$ is aligned to, or translated from, a source word $x_{j} .$ Then, the $i$ -th context vector $c_{i}$ is the expected annotation over all the annotations with probabilities $\alpha_{i j}$ The probability $\alpha_{i j},$ or its associated energy $e_{i j},$ reflects the importance of the annotation $h_{j}$ with respect to the previous hidden state $s_{i-1}$ in deciding the next state $s_{i}$ and generating $y_{i} .$

associated energy $e_{i j},$ 表达很好，这是一个相似度函数。

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector.

由于有了注意力机制，softly align 在 y 某些位置上。

相似度上，是词与词，还是位置与位置。
位置与位置，还需要再读一下。

为什么 $c_i$ 不一样，因为要形成一个矩阵，每个$s_{t-1}$都有$1,...,T_x$个$e$和$alpha$，所以聚合的 $c_i$，就是 by i 的了。对于 $y_i$ 来说，要知道哪一个$h_j$最好，最应该 attend to，哪一个 $alpha$ 更大，就会更加关注。

With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

attention 机制并没有对文本切割，而是用 attention 让 y 关注某一些 x，这样达到 soft 切，实现 annotation。

关注下矩阵图，热力图

这样就可以理解为什么是 JOINTLY LEARNING TO ALIGN AND TRANSLATE。

公式(5.6)可以看到下标为 i 和 j。 $a$是一个两两计算的匹配函数。其中

$s_{i-1}$是输出层$y_{t-1}$的隐藏层
$h_{j}$是输入层

这个遍历完以后，形成一个 NxN 的矩阵。

$这里可以发现每个 y 端的 RNN 神经元$s_t$都含有一个$c$。之前都是连接一个恒定的 $c$。$

Figure 5.11: 这里可以发现每个 y 端的 RNN 神经元$s_t$都含有一个$c$。之前都是连接一个恒定的 $c$。

下面一部分是理解 attention 的核心。

~~这里作者没有把 attention 层和输入层的隐藏层 concatenate。~~ 隐含了，见公式 (5.7)。

For more details on the architectures of the models and training procedure used in the experiments, see Appendices A and B.

具体的模型结果，参考附录还是有一些细节值得读一下。

Figure 5.12: *search 指的是 attention 的模型，在随着句子长度提高，效果越来越好。这个也是作者提到的解决的问题。

One of the motivations behind the proposed approach was the use of a fixed-length context vector in the basic encoder–decoder approach. We conjectured that this limitation may make the basic encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the performance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand, both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch- 50, especially, shows no performance deterioration even with sentences of length 50 or more. This superiority of the proposed model over the basic encoder–decoder is further confirmed by the fact that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).

attention 的加入，使得长句训练后的 BY 句子长度的效果没有得到衰减，如图 5.12。

5.3.4 RESULTS

Figure 5.13: 根据公式 (5.5) 这里是 alpha 的体现。

We can see from the alignments in Fig. 3 , that the alignment of words between English and French is largely monotonic. We see strong weights along the diagonal of each matrix. However, we also observe a number of non-trivial, non-monotonic alignments. Adjectives and nouns are typically ordered differently between French and English, and we see an example in Fig. [3] (a). From this figure, we see that the model correctly translates a phrase [European Economic Area] into [zone économique européen]. The RNNsearch was able to correctly align [zone] with [Area], jumping over the two words ([European] and [Economic]), and then looked one word back at a time to complete the whole phrase [zone économique européenne].

基本上高亮都在对角线上，这个没有问题，但是也可以注意到，area 和 zone，不在一个位置，但是 alpha 也被学习到了，达成了 soft alignment 的效果。

An additional benefit of the soft alignment is that it naturally deals with source and target phrases of different lengths, without requiring a counter-intuitive way of mapping some words to or from nowhere ([NULL]) (see, e.g., Chapters 4 and 5 of Koehn, 2010)

并且也不要求等长了，但是没有在作者的实验中看到。

As clearly visible from Fig. 2 the proposed model (RNNsearch) is much better than the conventional model (RNNencdec) at translating long sentences. This is likely due to the fact that the RNNsearch does not require encoding a long sentence into a fixed-length vector perfectly, but only accurately encoding the parts of the input sentence that surround a particular word.

如图 5.12，RNNsearch 更好的学习了长句子，作者的猜想是 RNNsearch 并不需要让一个长句子完美的放到一个固定长度的向量中，可以只关注某几个 attention 的地方。

Although the above approaches were shown to improve the translation performance over the stateof-the-art machine translation systems, we are more interested in a more ambitious objective of designing a completely new translation system based on neural networks. The neural machine translation approach we consider in this paper is therefore a radical departure from these earlier works. Rather than using a neural network as a part of the existing system, our model works on its own and generates a translation from a source sentence directly.

作者其实并非是优化神经网络，而是直接把 input 和 output 做了一个 mapping，以提高预测能力。

修改 idea，以 alignment 作为文稿的 idea

5.3.5 CONCLUSION

理解清楚 attention 的机制了，严格来说是 alignment，更好的表达，其实就是 input 和 output 的相似度矩阵。

5.3.6 MODEL ARCHITECTURE

The proposed scheme in Section 3 is a general framework where one can freely define, for instance, the activation functions f of recurrent neural networks (RNN) and the alignment model a.

alpha 定义为 alignment model。 alignment model 从 RNN 开始讲起。

5.3.6.1 RNN

这里定义的 RNN 不是 LTSM 而是改进版本的 LSTM，没有 input gate 和 forget gate，只有 update gate。

Variants on Long Short Term Memory What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above.

LSTM 有很多变形。

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

其实就是每个门，再加入 $C_{t-1}$。从方程的右边来看，没有差异了，入参一样，只是权重不共享。

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

这里是把遗忘门和输入门合并了。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

GRU 把遗忘门和输入门合并成一个东西，然后把 $C_{t-1}$和$h_{t-1}$合并。

update gate 这里提到了，需要使用，这里的 LSTM 只有两个门，如公式 (5.8) 并且保持了以上的假设，入参一致。

公式 (5.7) 中的$c_{i}$代表 alignment model 的位置。

The new state $s_{i}$ of the RNN employing $n$ gated hidden units ${ }^{8}$ is computed by \[ s_{i}=f\left(s_{i-1}, y_{i-1}, c_{i}\right)=\left(1-z_{i}\right) \circ s_{i-1}+z_{i} \circ \tilde{s}_{i} \] where o is an element-wise multiplication, and $z_{i}$ is the output of the update gates (see below). The proposed updated state $\tilde{s}_{i}$ is computed by \[ \tilde{s}_{i}=\tanh \left(W e\left(y_{i-1}\right)+U\left[r_{i} \circ s_{i-1}\right]+C c_{i}\right) \tag{5.7} \] where $e\left(y_{i-1}\right) \in \mathbb{R}^{m}$ is an $m$ -dimensional embedding of a word $y_{i-1},$ and $r_{i}$ is the output of the reset gates (see below). When $y_{i}$ is represented as a 1 -of- $K$ vector, $e\left(y_{i}\right)$ is simply a column of an embedding matrix $E \in \mathbb{R}^{m \times K}$. Whenever possible, we omit bias terms to make the equations less cluttered. The update gates $z_{i}$ allow each hidden unit to maintain its previous activation, and the reset gates $r_{i}$ control how much and what information from the previous state should be reset. We compute them by \[ \begin{array}{l} z_{i}=\sigma\left(W_{z} e\left(y_{i-1}\right)+U_{z} s_{i-1}+C_{z} c_{i}\right) \\ r_{i}=\sigma\left(W_{r} e\left(y_{i-1}\right)+U_{r} s_{i-1}+C_{r} c_{i}\right) \end{array} \tag{5.8} \] where $\sigma(\cdot)$ is a logistic sigmoid function.

如公式 (5.6)，右边的参数都有。

The alignment model should be designed considering that the model needs to be evaluated $T_{x} \times T_{y}$ times for each sentence pair of lengths $T_{x}$ and $T_{y} .$ In order to reduce computation, we use a singlelayer multilayer perceptron such that \[ a\left(s_{i-1}, h_{j}\right)=v_{a}^{\top} \tanh \left(W_{a} s_{i-1}+U_{a} h_{j}\right) \tag{5.6} \] where $W_{a} \in \mathbb{R}^{n \times n}, U_{a} \in \mathbb{R}^{n \times 2 n}$ and $v_{a} \in \mathbb{R}^{n}$ are the weight matrices. since $U_{a} h_{j}$ does not depend on $i,$ we can pre-compute it in advance to minimize the computational cost.

为什么这里是 2n
A.2 DETAILED DESCRIPTION OF THE MODEL 值得了解下，比如 i = 0 时。

The model takes a source sentence of $1-$ of-K coded word vectors as input \[ \mathbf{x}=\left(x_{1}, \ldots, x_{T_{x}}\right), x_{i} \in \mathbb{R}^{K_{x}} \] and outputs a translated sentence of 1 -of-K coded word vectors \[ \mathbf{y}=\left(y_{1}, \ldots, y_{T_{y}}\right), y_{i} \in \mathbb{R}^{K_{y}} \] where $K_{x}$ and $K_{y}$ are the vocabulary sizes of source and target languages, respectively. $T_{x}$ and $T_{y}$ respectively denote the lengths of source and target sentences.

所以输入和输出还是需要 padding，fixed length vector 还是使用的，只是 align softly 而已。

First, the forward states of the bidirectional recurrent neural network (BiRNN) are computed: \[ \vec{h}_{i}=\left\{\begin{array}{ll} \left(1-\vec{z}_{i}\right) \circ \vec{h}_{i-1}+\vec{z}_{i} \circ \underline{h}_{i}, \text { if } i>0 \\ 0 & , \text { if } i=0 \end{array}\right. \] where \[ \begin{array}{l} \overrightarrow{\underline{h}}_{i}=\tanh \left(\vec{W} \bar{E} x_{i}+\vec{U}\left[\vec{r}_{i} \circ \vec{h}_{i-1}\right]\right) \\ \vec{z}_{i}=\sigma\left(\vec{W}_{z} \bar{E} x_{i}+\vec{U}_{z} \vec{h}_{i-1}\right) \\ \vec{r}_{i}=\sigma\left(\vec{W}_{r} \bar{E} x_{i}+\vec{U}_{r} \vec{h}_{i-1}\right) \end{array} \] $\bar{E} \in \mathbb{R}^{m \times K_{x}}$ is the word embedding matrix. $\vec{W}, \vec{W}_{z}, \vec{W}_{r} \in \mathbb{R}^{n \times m}, \vec{U}, \vec{U}_{z}, \vec{U}_{r} \in \mathbb{R}^{n \times n}$ are weight matrices. $m$ and $n$ are the word embedding dimensionality and the number of hidden units, respectively. $\sigma(\cdot)$ is as usual a logistic sigmoid function. The backward states $\left(\bar{h}_{1}, \cdots, \overleftarrow{h}_{T_{x}}\right)$ are computed similarly. We share the word embedding matrix $\bar{E}$ between the forward and backward RNNs, unlike the weight matrices. We concatenate the forward and backward states to to obtain the annotations $\left(h_{1}, h_{2}, \cdots, h_{T_{x}}\right)$ where \[ h_{i}=\left[\begin{array}{c} \vec{h}_{i} \\ \overleftarrow{h}_{i} \end{array}\right] \]

这一部分需要掌握一点 LSTM 的知识才能够理解这里门的设置。

decoder 的模型类似，这里不多做解释。

The hidden state $s_{i}$ of the decoder given the annotations from the encoder is computed by \[ s_{i}=\left(1-z_{i}\right) \circ s_{i-1}+z_{i} \circ \tilde{s}_{i} \] where \[ \begin{array}{l} \tilde{s}_{i}=\tanh \left(W E y_{i-1}+U\left[r_{i} \circ s_{i-1}\right]+C c_{i}\right) \\ z_{i}=\sigma\left(W_{z} E y_{i-1}+U_{z} s_{i-1}+C_{z} c_{i}\right) \\ r_{i}=\sigma\left(W_{r} E y_{i-1}+U_{r} s_{i-1}+C_{r} c_{i}\right) \end{array} \] $E$ is the word embedding matrix for the target language. $W, W_{z}, W_{r} \in \mathbb{R}^{n \times m}, U, U_{z}, U_{r} \in \mathbb{R}^{n \times n}$ and $C, C_{z}, C_{r} \in \mathbb{R}^{n \times 2 n}$ are weights. Again, $m$ and $n$ are the word embedding dimensionality and the number of hidden units, respectively. The initial hidden state $s_{0}$ is computed by $s_{0}=$ $\tanh \left(W_{s} \overleftarrow{h}_{1}\right),$ where $W_{s} \in \mathbb{R}^{n \times n}$ The context vector $c_{i}$ are recomputed at each step by the alignment model: \[ c_{i}=\sum_{i=1}^{T_{x}} \alpha_{i j} h_{j} \] where \[ \begin{aligned} \alpha_{i j} &=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_{x}} \exp \left(e_{i k}\right)} \\ e_{i j} &=v_{a}^{\top} \tanh \left(W_{a} s_{i-1}+U_{a} h_{j}\right) \end{aligned} \] and $h_{j}$ is the $j$ -th annotation in the source sentence (see Eq. (7) ). $v_{a} \in \mathbb{R}^{n^{\prime}}, W_{a} \in \mathbb{R}^{n^{\prime} \times n}$ and $U_{a} \in \mathbb{R}^{n^{\prime} \times 2 n}$ are weight matrices. Note that the model becomes RNN Encoder-Decoder $\sqrt{\mathrm{Cho}}$ et al. $, 2014 \mathrm{a}),$ if we fix $c_{i}$ to $\vec{h}_{T_{x}}$ With the decoder state $s_{i-1},$ the context $c_{i}$ and the last generated word $y_{i-1},$ we define the probability of a target word $y_{i}$ as \[ p\left(y_{i} \mid s_{i}, y_{i-1}, c_{i}\right) \propto \exp \left(y_{i}^{\top} W_{o} t_{i}\right) \] where \[ t_{i}=\left[\max \left\{\tilde{t}_{i, 2 j-1}, \tilde{t}_{i, 2 j}\right\}\right]_{j=1, \ldots, l}^{\top} \] and $\tilde{t}_{i, k}$ is the $k$ -th element of a vector $\tilde{t}_{i}$ which is computed by \[ \tilde{t}_{i}=U_{o} s_{i-1}+V_{o} E y_{i-1}+C_{o} c_{i} \] $W_{o} \in \mathbb{R}^{K_{y} \times l}, U_{o} \in \mathbb{R}^{2 l \times n}, V_{o} \in \mathbb{R}^{2 l \times m}$ and $C_{o} \in \mathbb{R}^{2 l \times 2 n}$ are weight matrices. This can be under- stood as having a deep output (Pascanu et al., 2014 ) with a single maxout hidden layer (Goodfellow et al.,2013).

6 BERT

建议使用服务器安装 Python 和作为训练环境，没有安装问题。服务器现在是跑不动的，我使用了 Google CoLab 跑得动，大家可以都尝试，毕竟免费。

bert_pretrained_model = MyNoTupleModel.from_pretrained(bert_model, num_labels=2)

100%|██████████| 520/520 [00:00<00:00, 239727.20B/s]
  0%|          | 51200/411577189 [00:30<29:09:02, 3921.42B/s]

需要一天训练，因此建议使用 CoLab 完成。

image

需要15分钟，完成一次迭代，耗时可以接受。

参考王树义 (2019) 使用 fast.ai 完成 Bert 的预训练模型微调。我是在 CoLab 的环境下执行，没有安装问题，阿里云服务器也进行了测试。

conda install -c pytorch pytorch-nightly-cpu 
conda install -c fastai torchvision-nightly-cpu 
conda install -c fastai fastai

image

一个 epoch 下来要15分钟。

image

训练完需要半小时。

7 TextCNN

7.1 Intro

CNN(Convolutional Neural Network) 卷积神经网络在图像处理方面应用很多，TextCNN 模型创新性的将 CNN 结构应用在了 NLP 领域，为文本分类提供了新的思路，TextCNN 解决了传统方法分词处理和词向量表示两个关键问题，其贡献主要有以下几点(杨照璐 and 郑世强 2020)：

避免分词，以字符为单位的文本向量表示
CNN 能捕捉局部区域的词序及语义信息，所表达的特征更加丰富
采用不同尺寸的卷积核，可以提取到 n-gram 的特征
卷积结构运算速度快，模型响应时长控制在 50ms 以下

参考 Cecchini (2019 Chapter 3) 在 Keras 的实现，

model.add(Embedding(vocabulary_size, wordvec_dim, ...))
model.add(Conv1D(num_filters=32, kernel_size=3, padding='same'))
model.add(MaxPooling1D(pool_size=2))

然后，查看具体的介绍，操作过程和图像类似。

7.2 Preview Review

7.2.1 Introduction

We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks.

CNN 一定需要进行调参的。

which include sentiment analysis and question classification.

包含了情感分析的情况。

In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model.

结构清晰。

We initially keep the word vectors static and learn only the other parameters of the model.

词向量在模型中不训练，fixed。

7.2.2 Model

Figure 7.1: 利用词向量转写为一个 numerical representation。文本转写为一个图片格式，channel为1和，长为句子长度，宽为词向量长度。因此这里就变成了一个图片认为，这是理解 Kim (2014) idea 的核心。

The model architecture, shown in figure $1,$ is a slight variant of the CNN architecture of Collobert et al. $(2011) .$ Let $\mathbf{x}_{i} \in \mathbb{R}^{k}$ be the $k$ -dimensional word vector corresponding to the $i$ -th word in the sentence. A sentence of length $n$ (padded where necessary) is represented as \[ \mathbf{x}_{1: n}=\mathbf{x}_{1} \oplus \mathbf{x}_{2} \oplus \ldots \oplus \mathbf{x}_{n} \] where $\oplus$ is the concatenation operator. In general, let $\mathbf{x}_{i: i+j}$ refer to the concatenation of words $\mathbf{x}_{i}, \mathbf{x}_{i+1}, \ldots, \mathbf{x}_{i+j}$.

A convolution operation in- volves a filter $\mathbf{w} \in \mathbb{R}^{h k},$ which is applied to a window of $h$ words to produce a new feature. For example, a feature $c_{i}$ is generated from a window of words $\mathbf{x}_{i: i+h-1}$ by \[ c_{i}=f\left(\mathbf{w} \cdot \mathbf{x}_{i: i+h-1}+b\right) \] Here $b \in \mathbb{R}$ is a bias term and $f$ is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence $\left\{\mathbf{x}_{1: h}, \mathbf{x}_{2: h+1}, \ldots, \mathbf{x}_{n-h+1: n}\right\}$ to produce a feature map \[ \mathbf{c}=\left[c_{1}, c_{2}, \ldots, c_{n-h+1}\right] \] with $\mathbf{c} \in \mathbb{R}^{n-h+1}$. We then apply a max-overtime pooling operation (Collobert et al., 2011 ) over the feature map and take the maximum value $\hat{c}=\max \{\mathbf{c}\}$ as the feature corresponding to this particular filter. The idea is to capture the most important feature-one with the highest value-for each feature map. This pooling scheme naturally deals with variable sentence lengths.

h 是 window。

经过一层 filter，产生一个 feature map (类似于一张图)，多层 filter，产生多个 feature maps。

然后 filter 是 max pooling 处理的，Kim (2014) 认为局部最大值反映了最重要的特征。

We have described the process by which one feature is extracted from one filter. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

这个就是倒数第二层的结构了，后面就是一个全连接的 softmax 层，直接进行多分类任务了。

In one of the model variants, we experiment with having two ‘channels’ of word vectors—one that is kept static throughout training and one that is fine-tuned via backpropagation (section 3.2).2 In the multichannel architecture, illustrated in figure 1, each filter is applied to both channels and the results are added to calculate ci in equation (2). The model is otherwise equivalent to the single channel architecture.

从图片角度理解，Kim (2014) 用’词向量’构造了一个图片，Kim (2014) 让图像有两个原色，

词向量不训练的
词向量训练的

这是一个小的 trick。

7.2.3 Results and Discussion

Results of our models against other methods are listed in table 2. Our baseline model with all randomly initialized words (CNN-rand) does not perform well on its own. While we had expected performance gains through the use of pre-trained vectors, we were surprised at the magnitude of the gains. Even a simple model with static vectors (CNN-static) performs remarkably well, giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al., 2014) or require parse trees to be computed beforehand (Socher et al., 2013). These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static)

Figure 7.2: 显然随机给词向量的情况是不好的结果，加入词向量后无论是否可训练，结果都表现不错。

7.2.4 Conclusion

7.3 Coding

附录

参考文献

Cecchini, David. 2019. “Recurrent Neural Networks for Language Modeling in Python.” DataCamp. 2019. https://learn.datacamp.com/courses/recurrent-neural-networks-for-language-modeling-in-python.

Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–51. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1181.

Nguyen, Michael. 2018. “超生动图解LSTM和GPU：拯救循环神经网络的记忆障碍就靠它们了！.” 量子位. 2018. https://mp.weixin.qq.com/s/xWZwWHsUMQwIue9wlQgg3A.

机器学习算法与自然语言处理. 2019. “NLP领域预训练模型的现状及分析.” 机器学习算法与自然语言处理. 2019. https://mp.weixin.qq.com/s/vFsJE81Rs8C1zKoNv3K-bA.

杨照璐, and 郑世强. 2020. “文本反垃圾在花椒直播中的应用概述.” TensorFlow. 2020. https://mp.weixin.qq.com/s/PFUjEyf3lZY_ZYL2BAxO0A.

王树义. 2019. “如何在 Fast.ai 用 Bert 做中文文本分类？.” 玉树芝兰. 2019. https://mp.weixin.qq.com/s/5pp-4cL6nfpr0J0ffhy9Vg.

青青山螺应如是. 2019a. “【NLP】机器翻译和注意力机制(1)——Mt基础原理.” 青影三弄. 2019. https://mp.weixin.qq.com/s/N-Fc48BZeowp-tgQIyf_5g.

———. 2019b. “【NLP】神经网络机器翻译和注意力机制(2)——pytorch实现MT+AM代码详解.” 青影三弄. 2019. https://mp.weixin.qq.com/s/mlpcl7kIpVSKhV3L0Ng5yQ.

语言模型学习笔记

语言模型学习笔记

1 综述

2 LSTM 直观理解

3 SMT

4 NMT

5 Attention

5.1 直接理解

5.2 快速解释

5.3 文献阅读

5.3.1 INTRODUCTION

5.3.2 BACKGROUND

5.3.3 ALIGNMENT

5.3.4 RESULTS

5.3.5 CONCLUSION

5.3.6 MODEL ARCHITECTURE

5.3.6.1 RNN

6 BERT

7 TextCNN

7.1 Intro

7.2 Preview Review

7.2.1 Introduction

7.2.2 Model

7.2.3 Results and Discussion

7.2.4 Conclusion

7.3 Coding

附录

参考文献

语言模型 学习笔记

1 综述

2 LSTM 直观理解

3 SMT

4 NMT

5 Attention

5.1 直接理解

5.2 快速解释

5.3 文献阅读

5.3.1 INTRODUCTION

5.3.2 BACKGROUND

5.3.3 ALIGNMENT

5.3.4 RESULTS

5.3.5 CONCLUSION

5.3.6 MODEL ARCHITECTURE

5.3.6.1 RNN

6 BERT

7 TextCNN

7.1 Intro

7.2 Preview Review

7.2.1 Introduction

7.2.2 Model

7.2.3 Results and Discussion

7.2.4 Conclusion

7.3 Coding

附录

参考文献

语言模型学习笔记