Nested LSTM: makes the cell update depend on another
LSTM unit, supposedly this allows more long-term memory compared to stacking LSTM layers.
Even though the LSTM unit and GRU solves the vanishing gradient problem on a theoretical level, long-term memorization continues to be a challenge in recurrent neural networks.
There are alternatives to LSTM, most popular is the Gated Recurrent Unit (GRU). However, the GRU doesnt necessarily give better long-term context, particularly as it solves the vanishing gradient problem without using any internal memory.
The Nested LSTM unit attemps to solve the long-term memorization from a more practical point of view. Where the classic LSTM unit solves the vanishing gradient problem by adding internal memory, and the GRU attemps to be a faster solution than LSTM by using no internal memory, the Nested LSTM goes in the opposite direction of GRU - as it adds additional memory to the unit  .
The idea here is that adding additional memory to the unit allows for more long-term memorization.
The additional memory is integrated by changing how the cell value is updated. Instead of defining the cell value update as , it uses another LSTM unit: Note that the variables defined in are different from those defined below. The end result is that an unit have two memory states.
The complete set of equations then becomes:
Like in vanilla LSTM, the gate activation functions are usually the simoid activation function. However, only the is set to . While, is just the identity function, otherwise two non-linear activation functions would be applied on the same scalar without any change, except for the multiplication by the input gate. The activation functions for remains the same.
The abstraction, of how to combine the input with the cell value, allows a lot of flexibility. Using this abstraction, it is not only possible to add one extra internal memory state but the internal unit can recursively be replaced as many internal
units as one would wish, thereby adding even more internal memory.
From a theoretical view, whether or not the Nested LSTM unit improves long context is not really clear. The LSTM unit theoretically solves the vanishing gradient problem and a network of LSTM units is Turing complete. In theory, an LSTM unit should be sufficient for solving problems that require long-term memorization.
That being said, it is often very difficult to train LSTM and GRU based recurrent neural networks. These difficulties often come down to the curvature of the loss function and it is possible that the Nested LSTM improves this curvature and therefore is easier to optimize.