Notes for Prof. Hung-Yi Lee's ML Lecture: RNN

April 05, 2022

RNN

Structure & Operation

RNN

We can also make deep RNNs that have more than 1 layer of recurrent layers. Such simplest RNN is also named Elman network. There’s also Jordan network, which feed the output back into hidden layers. Jordan network may have better performance than Elman network because the output values are trained with a target, while a hidden layer is not.

Elman & Jordan Network

Bi-Directionsl RNN

To let the network be able to consider the whole input sequence when it produce the first output, we let 1 2nd network read the input sequence from the end to the beginning, and let sum of the 2 network’s output be the final output.

Bi-directional RNN

Long Short-Term Memory (LSTM)

Structure & Operation

LSTM overview

LSTM operation

When $f(z_f) = 1$, the forget gate is on, then the memory $c$ is actually retained. The forget gate is actually works as a “retain gate”.

An LSTM neuton has 4 inputs, so an LSTM layer has 4 times of parameters than a notmal layer.

Training of any RNN is based on back propagation through time.

Furthermore, for the complete LSTM, we let the Peeehole and the output feed to the input of the next step.

complete LSTM

Discussion on Terminologies

The hiphen pythen should be between “short” and “term” because LSTM is a kind of short-term memory that is relatively longer; the memory may be either retained or removed.

Because LSTM is now the standard of RNN, people may mean “LSTM” when they say “RNN”. We can call the original simple RNN “Simple RNN” or “vanila RNN”.

Discussion on RNN Training

Training of Vanilla RNN may be difficult, for example, the loss may oscillate:

RNN training curve

the reason is that the error surface is either very flat or very steep

RNN error surface

The sigmoid activation function is not the cause of the rough error surface; on RNN, ReLu is usually worse than sigmoid.

Actually, the reason is that the weights that propagate from the previous step to the next affect the final output exponentially, so its derivative to the loss can be extremely large or small.

exponential effect of RNN's weights

Clipping is a must when training RNNs to avoid get NaN values.

Another solution is replace vanilla RNN by LSTM. Because the memory and the new input are directly added, the influence never disappears unless forget gate is closed, which that the LSTM overcome the gradient vanishing (not gradient explosion). Without gradient vanishing, there’s no flat regions in the error surface, so we can safely use small learning rates. Actually, the motination of the earliest LSTM (without forget gate) is to handle the gradient vanishing.

Or, we can also use the smaller version of LSTM: Gated Recurrent Unit (GRU) [Cho, EMNLP’14]. GRU has less parameters, so it overfits less. By GRU, we just turn the forget off/on when the input gate is on/off.