Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs学习笔记

介绍-什么是RNN

1.RNN的主要思想是利用序列信息。
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it.
2.RNN的输出依赖前面的输入,也可以理解为一种记忆功能。
RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far.
3.RNN的记忆步长不会很长。
In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later)
典型的RNN模型
在这里插入图片描述
The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows:

x_t is the input at time step t. For example, x_1 could be a one-hot vector corresponding to the second word of a sentence.
s_t is the hidden state at time step t. It’s the “memory” of the network. s_t is calculated based on the previous hidden state and the input at the current step: s_t=f(Ux_t + Ws_{t-1}). The function f usually is a nonlinearity such as tanh or ReLU. s_{-1}, which is required to calculate the first hidden state, is typically initialized to all zeroes.
o_t is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. o_t = \mathrm{softmax}(Vs_t).

关于这个图还有几点注意:
1.s_t可以认为是记忆单元

2.UVW参数共享
Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (U, V, W above) across all steps. This reflects the fact that we are performing the same task at each step(RNN把序列中的每一个元素样本都平等看待,当作同样的任务在处理), just with different inputs. This greatly reduces the total number of parameters we need to learn.

3.RNN的每一个输出都可以作为一种预测结果
The above diagram has outputs at each time step(RNN中每一步都可以作出next step预测), but depending on the task this may not be necessary. For example, when predicting the sentiment of a sentence we may only care about the final output, not the sentiment after each word. Similarly, we may not need inputs at each time step. The main feature of an RNN is its hidden state, which captures some information about a sequence.

RNN的结构能展示的另一个特性是,RNN可以处理不定长的序列,相比于CNN在这点上还是很有优势的。