An exploration of recent developments of Recurrent Units in Recurrent Neural Networks (RNN) and their effect on contextual understanding in text.
User types input sequence.
Recurrent neural network processes the sequence.
The output for the last character is used.
The most likely suggestions are extracted.
The indices are looked up in a dictionary.
Autocomplete: An example application, showing how a simple recurrent neural network can be used for autocompletion. The network uses past information and understands the next word should be a country. Try removing the last letters and see that the prediction uses contextual understanding (reset).
Recent advances in handwriting recognition, speech recognition , and machine translation  have with only a few exceptions   been based on recurrent neural networks.
Recurrent neural networks are, funnily enough, a type of neural network. Neural networks have been around since at least 1975 but have over the recent years got a comeback and become very popular. This is likely due to the advances in General Purpose GPU (GPGPU) programming, that provides the computational resources to train them and larger datasets that provide enough data to train large networks.
If you are not familiar with neural networks, it is recommended that you become at least a bit familiar. Today there are many sources to learn from. The Neural Networks and Deep Learning book by Michael Nielsen is quite easy to get started with, chapter 2 should give most of the required background. If you are more curious the Deep Learning book by Goodfellow et. al. is much more extensive, chapter 5 should be a good start.
To give a too short introduction, vanilla neural networks are
essentially composed of two things: sums and non-linear function, like the sigmoid function. In matrix notation this can be written as: Where is the output?
In this article, the output is in terms of probabilities. To turn something into probabilities the Softmax function can be used.
The examples mentioned earlier may use additional techniques such as attention mechanisms  to work with an unknown alignment between the source and the target sequence.
However, the foundation for these networks is still the recurrent neural network. Likewise, a common challenge for many of these applications is to get the network to memorize past content from the input sequences and use this for contextual understanding later in the sequence.
This memorization problem is what is explored in this article. To this end, this article doesn’t go into the details of how to deal with an unknown alignment but rather focuses on problems where the alignment is known and explores the memorization issue for those problems. This is heavily inspired by the recent article on Nested LSTMs , which are also discussed in this article.
Recurrent neural networks (RNNs) are well known and thoroughly explained in literature. To keep it short, recurrent neural networks let you model a sequence of vectors. RNNs do this by iterating over the sequence, where each layer uses the output from the same layer in the previous “time” iteration, combined with the output from the previous layer in the same “time” iteration.
In theory, this type of network allows it in each iteration to know about every part of the sequence that came before.
Given an input sequence, such a model can be expressed using the following set of equations: Note how the output from the previous iteration ( ) and the output from the previous layer in the same iteration ( ) are combined, is abstracted away.
For a vanilla recurrent neural network, the recurrent unit is:
Vanishing Gradient Problem
Deep neural networks can suffer from a vanishing gradient problem where the gradient used in optimization becomes minuscule. This is because the used in backpropagation ends up being multiplicatively depending on the of the next layer. This problem can be mitigated through careful initialization of the weights , by choosing an activation function such as the Rectified Linear Unit (ReLU), or adding residual connections 7.
In classic recurrent neural networks, this problem becomes much worse, due to the time dependencies as the time dependencies essentially unfold into a potentially infinite deep neural network.
An intutive way of viewing this problem is that the vanilla recurrent network forces an update of the state . This forced update, is what courses the vanishing gradient problem.
This forced update is also insufficient as irrelevant input data, such as skip words, blur out important information from previous iterations.
Long Short-Term Memory
The Long Short-Term Memory (LSTM) unit replaces the simple unit from earlier. Each LSTM unit contains a single memory scalar that can be protected or written to, depending on the input and forget gate. This structure has shown to be very powerful in solving complex sequential problems . LSTM is well known and thoroughly explained in the literature and therefore not discussed here.
However as it plays a critical part in the Nested LSTM unit, that is discussed later, its equations are mentioned here.
The gate activation functions are usually the simoid activation function.
While are usually .
Even though the LSTM unit and GRU solves the vanishing gradient problem on a theoretical level, long-term memorization continues to be a challenge in recurrent neural networks.
There are alternatives to LSTM, most popular is the Gated Recurrent Unit (GRU). However, the GRU doesnt necessarily give better long-term context, particularly as it solves the vanishing gradient problem without using any internal memory.
The Nested LSTM unit attemps to solve the long-term memorization from a more practical point of view. Where the classic LSTM unit solves the vanishing gradient problem by adding internal memory, and the GRU attemps to be a faster solution than LSTM by using no internal memory, the Nested LSTM goes in the opposite direction of GRU – as it adds additional memory to the unit .
The idea here is that adding additional memory to the unit allows for more long-term memorization.
The additional memory is integrated by changing how the cell value is updated. Instead of defining the cell value update as , it uses another LSTM unit: Note that the variables defined in are different from those defined below. The end result is that an unit have two memory states.
The complete set of equations then becomes:
Like in vanilla LSTM, the gate activation functions are usually the simoid activation function. However, only the is set to . While, is just the identity function, otherwise two non-linear activation functions would be applied on the same scalar without any change, except for the multiplication by the input gate. The activation functions for remains the same.
The abstraction, of how to combine the input with the cell value, allows a lot of flexibility. Using this abstraction, it is not only possible to add one extra internal memory state but the internal unit can recursively be replaced as many internal units as one would wish, thereby adding even more internal memory.
From a theoretical view, whether or not the Nested LSTM unit improves long context is not really clear. The LSTM unit theoretically solves the vanishing gradient problem and a network of LSTM units is Turing complete. In theory, an LSTM unit should be sufficient for solving problems that require long-term memorization.
That being said, it is often very difficult to train LSTM and GRU based recurrent neural networks. These difficulties often come down to the curvature of the loss function and it is possible that the Nested LSTM improves this curvature and therefore is easier to optimize.
Comparing Recurrent Units
Comparing the different Recurrent Units is not a trivial task. Different problem requires different contextual understanding and therefore requires different memorization.
A good problem for analyzing the contextual understanding, should have a humanly interpretive output and depend both on long and short-term memorization.
To this end, the autocomplete problem is used. Each character is mapped to a target that represents the entire word. To make it extra difficult, the space leading up to the word should also map to that word. The text is from the full text8 dataset, where each observation consists of maximum 200 characters and is ensured to not contain partial words. 90% of the observations are used for training, 5% for validation and 5% for testing.
The input vocabulary is a-z, space, and a padding symbol. The output vocabulary consists of the most frequent words, and two additional symbols, one for padding and one for unknown words. The network is not penalized for predicting padding and unknown words wrong.
The GRU and LSTM models, each have 2 layers of 600 units. Similarly, the Nested LSTM model has 1 layer of 600 units but with 2 internal memory states.
Additionally, each model has an input embedding layer and a final dense layer to match the vocabulary size.
Model Configurations: shows the number of layers, units and parameters for each model.
There are 508583 sequences in the training dataset and a batch size of 64 observations is used. A single iteration over the entire dataset then corresponds to 7946 epochs, which is enough to train the network, therefore the models only trained for 7946 epochs. For training, Adam optimization is used with default parameters.
Model testing: shows the testing loss and accuracy for the GRU, LSTM, and Nested LSTM models on the autocomplete problem.
As seen from the results the models are more or less equally fast.
Surprisingly the Nested LSTM is not better than the LSTM or GRU models.
This somewhat contradicts the results found in the Nested LSTM paper , although they tested model on different problems and therefore the results are not exactly comparable.
Never or less one would still expect the Nested LSTM model to perform better for this problem, where long-term memorization is important for the contextual understanding.
An unexpected result is that the Nested LSTM model initially converges much faster than the LSTM and GRU models. This, combined with the worse performance, indicates that the Nested LSTM optimizes forwards an unideal local minimum.
The Nested LSTM model did not provide any benefits over the LSTM or GRU models. This indicates, at least for the autocomplete example, that there isn’t a connection between the number of internal memory states and the models ability to memorize and use that memory for contextual understanding.
Many thanks to the authors of the original Nested LSTM paper , Joel Ruben, Antony Moniz, and David Krueger. Even though our findings weren’t the same, they have inspired much of this article and shown that something as used as the recurrent unit is still an open research area.
Deep Speech: Scaling up end-to-end speech recognition[PDF] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. and Ng, A.Y., 2014. arXivreprint arXiv:1412.5567.
Sequence to Sequence Learning with Neural Networks[PDF] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. arXivreprint arXiv:1409.3215.
Attention Is All You Need[PDF] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. arXivreprint arXiv:1706.03762.
Neural Machine Translation in Linear Time[PDF] Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v.d. and Kavukcuoglu, A.G.K., 2017. arXivreprint arXiv:1610.10099.
Neural Machine Translation by Jointly Learning to Align and Translate[PDF] Bahdanau, D., Cho, K. and Bengio, Y., 2014. arXivreprint arXiv:1409.0473.
Supervised Sequence Labelling with Recurrent Neural Networks[PDF] He, K., Zhang, X., Ren, S. and Sun, J., 2015. arXivreprint arXiv:1512.03385.
Supervised Sequence Labelling with Recurrent Neural Networks[PDF] Graves, A., 2008.
Supervised Sequence Labelling with Recurrent Neural Networks[PDF] Cho, K., Merrienboer, B.v., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. arXivreprint arXiv:1406.1078.
Nested LSTMs[PDF] Moniz, J.R.A. and Krueger, D., 2018. arXivreprint arXiv:1801.10308.