**An exploration of recent developments of Recurrent Units in Recurrent Neural
Networks (RNN) and their effect on contextual understanding in text.**

## Introduction

Recent advances in handwriting recognition, speech recognition [1], and machine translation [2] have with only a few exceptions [3] [4] been based on recurrent neural networks.

### Neural Networks

Recurrent neural networks are, funnily enough, a type of neural network. Neural networks have been around since at least 1975 but have over the recent years got a comeback and become very popular. This is likely due to the advances in General Purpose GPU (GPGPU) programming, that provides the computational resources to train them and larger datasets that provide enough data to train large networks.

If you are not familiar with neural networks, it is recommended that you become at least a bit familiar. Today there are many sources to learn from. The Neural Networks and Deep Learning book by Michael Nielsen is quite easy to get started with, chapter 2 should give most of the required background. If you are more curious the Deep Learning book by Goodfellow et. al. is much more extensive, chapter 5 should be a good start.

To give a too short introduction, vanilla neural networks are
essentially composed of two things: sums and non-linear function, like
the sigmoid
function. In matrix notation this can be written as:

In this article the output is in terms of probabilities. To turn something
into probabilities the
Softmax function
can be used.

### Memorization problem

The examples mentioned earlier, may use additional techniques such as attention mechanisms [5] to work with an unknown alignment between the source and the target sequence.

However, the foundation for these networks is still the recurrent neural network. Likewise, a common challenge for many of these applications is to get the network to memorize past content from the input sequences and use this for contextual understanding later in the sequence.

This memorization problem is what is explored in this article. To this end, this article doesn't go into the details of how to deal with an unknown alignment but rather focuses on problems where the alignment is known and explores the memorization issue for those problems. This is heavily inspired by the recent article on Nested LSTMs [6], which are also discussed in this article.

## Recurrent Units

Recurrent neural networks (RNNs) are well known and thoroughly explained in literature. To keep it short, recurrent neural networks lets you model a sequence of vectors. RNNs do this by iterating over the sequence, where each layer uses the output from the same layer in the previous "time" iteration, combined with the output from the previous layer in the same "time" iteration.

In theory, this type of network allows it in each iteration to know about every part of the sequence that came before.

Given an input sequence

For a vanilla recurrent neural network, the recurrent unit

## Vanishing Gradient Problem

Deep neural networks can suffer from a vanishing gradient problem where
the gradient used in optimization becomes minuscule. This is because the

In classic recurrent neural networks, this problem becomes much worse,
due to the time dependencies as the time dependencies essentially unfold
into a potentially infinite deep neural network.

An intutive way of viewing this problem is that the vanilla recurrent
network forces an update of the state

## Long Short-Term Memory

The Long Short-Term Memory (LSTM) unit replaces the simple

The gate activation functions

## Nested LSTM

Even though the LSTM unit and GRU solves the vanishing gradient problem on a theoretical level, long-term memorization continues to be a challenge in recurrent neural networks.

There are alternatives to LSTM, most popular is the Gated Recurrent Unit (GRU). However, the GRU doesn’t necessarily give better long-term context, particularly as it solves the vanishing gradient problem without using any internal memory.

The Nested LSTM unit attemps to solve the long-term memorization from a more practical point of view. Where the classic LSTM unit solves the vanishing gradient problem by adding internal memory, and the GRU attemps to be a faster solution than LSTM by using no internal memory, the Nested LSTM goes in the opposite direction of GRU - as it adds additional memory to the unit [6].

The idea here is that adding additional memory to the unit allows for more long-term memorization.

The additional memory is integrated by changing how the cell value

The complete set of equations then becomes:

Like in vanilla LSTM, the gate activation functions

The abstraction, of how to combine the input with the cell value, allows
a lot of flexibility. Using this abstraction, it is not only possible
to add one extra internal memory state but the internal

From a theoretical view, whether or not the Nested LSTM unit improves long context is not really clear. The LSTM unit theoretically solves the vanishing gradient problem and a network of LSTM units is Turing complete. In theory, an LSTM unit should be sufficient for solving problems that require long-term memorization.

That being said, it is often very difficult to train LSTM and GRU based recurrent neural networks. These difficulties often come down to the curvature of the loss function and it is possible that the Nested LSTM improves this curvature and therefore is easier to optimize.

## Comparing Recurrent Units

Comparing the different Recurrent Units is not a trivial task. Different problem requires different contextual understanding and therefore requires different memorization.

A good problem for analyzing the contextual understanding, should have a humanly interpretive output and depend both on long and short-term memorization.

To this end, the autocomplete problem is used. Each character is mapped to a target that represents the entire word. To make it extra difficult, the space leading up to the word should also map to that word. The text is from the full text8 dataset, where each observation consists of maximum 200 characters and is ensured to not contain partial words. 90% of the observations are used for training, 5% for validation and 5% for testing.

The input vocabulary is a-z, space, and a padding symbol. The output
vocabulary consists of the

The GRU and LSTM models, each have 2 layers of 600 units. Similarly, the Nested LSTM model has 1 layer of 600 units but with 2 internal memory states. Additionally, each model has an input embedding layer and a final dense layer to match the vocabulary size.

Model | Units | Layers | Depth | Parameters | ||
---|---|---|---|---|---|---|

Embedding | Recurrent | Dense | ||||

GRU | 600 | 2 | N/A | 16200 | 4323600 | 9847986 |

LSTM | 600 | 2 | N/A | 16200 | 5764800 | 9847986 |

Nested LSTM | 600 | 1 | 2 | 16200 | 5764800 | 9847986 |

**Model Configurations:**shows the number of layers, units and parameters for each model.

There are 508583 sequences in the training dataset and a batch size of 64 observations is used. A single iteration over the entire dataset then corresponds to 7946 epochs, which is enough to train the network, therefore the models only trained for 7946 epochs. For training, Adam optimization is used with default parameters.

Model | Cross Entropy | Accuracy |
---|---|---|

GRU | 2.1497 | 51.61% |

LSTM | 2.2899 | 49.90% |

Nested LSTM | 2.6051 | 45.47% |

**Model testing:**shows the testing loss and accuracy for the GRU, LSTM, and Nested LSTM models on the autocomplete problem.

As seen from the results the models are more or less equally fast. Surprisingly the Nested LSTM is not better than the LSTM or GRU models. This somewhat contradicts the results found in the Nested LSTM paper [6], although they tested model on different problems and therefore the results are not exactly comparable. Never or less one would still expect the Nested LSTM model to perform better for this problem, where long-term memorization is important for the contextual understanding.

An unexpected result is that the Nested LSTM model initially converges much faster than the LSTM and GRU models. This, combined with the worse performance, indicates that the Nested LSTM optimizes forwards an unideal local minimum.

## Conclusion

The Nested LSTM model did not provide any benefits over the LSTM or GRU models. This indicates, at least for the autocomplete example, that there isn't a connection between the number of internal memory states and the models ability to memorize and use that memory for contextual understanding.

## Acknowledgments

Many thanks to the authors of the original Nested LSTM paper [6], Joel Ruben, Antony Moniz, and David Krueger. Even though our findings weren't the same, they have inspired much of this article and shown that something as used as the recurrent unit is still an open research area.

## References

**Deep Speech: Scaling up end-to-end speech recognition**[PDF]

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. and Ng, A.Y., 2014. arXivreprint arXiv:1412.5567.**Sequence to Sequence Learning with Neural Networks**[PDF]

Sutskever, I., Vinyals, O. and Le, Q.V., 2014. arXivreprint arXiv:1409.3215.**Attention Is All You Need**[PDF]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. arXivreprint arXiv:1706.03762.**Neural Machine Translation in Linear Time**[PDF]

Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v.d. and Kavukcuoglu, A.G.K., 2017. arXivreprint arXiv:1610.10099.**Neural Machine Translation by Jointly Learning to Align and Translate**[PDF]

Bahdanau, D., Cho, K. and Bengio, Y., 2014. arXivreprint arXiv:1409.0473.**Supervised Sequence Labelling with Recurrent Neural Networks**[PDF]

He, K., Zhang, X., Ren, S. and Sun, J., 2015. arXivreprint arXiv:1512.03385.**Supervised Sequence Labelling with Recurrent Neural Networks**[PDF]

Graves, A., 2008.**Supervised Sequence Labelling with Recurrent Neural Networks**[PDF]

Cho, K., Merrienboer, B.v., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. arXivreprint arXiv:1406.1078.**Nested LSTMs**[PDF]

Moniz, J.R.A. and Krueger, D., 2018. arXivreprint arXiv:1801.10308.