— Ch. 1 · The Vanishing Gradient Problem —
Long short-term memory.
~3 min read · Ch. 1 of 5
In 1991, Sepp Hochreiter submitted a German diploma thesis that identified a critical flaw in early recurrent neural networks. The problem was not theoretical but practical and computational. When training these classic models using back-propagation, long-term gradients tended to vanish. Very small numbers crept into the computations during this process. These tiny values caused the model to effectively stop learning over time. This phenomenon became known as the vanishing gradient problem. It prevented standard RNNs from tracking arbitrary long-term dependencies in input sequences. Hochreiter's supervisor Jürgen Schmidhuber considered the thesis highly significant for its identification of this barrier.
Architectural Design And Gates
An LSTM unit typically consists of a cell and three specific gates: an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals while the gates regulate information flow. Forget gates decide what information to discard from the previous state by mapping it to a value between zero and one. A rounded value of one signifies retention of the information. A value of zero represents discarding. Input gates decide which pieces of new information to store in the current cell state using the same system. Output gates control which pieces of information in the current cell state to output. They assign a value from zero to one considering the previous and current states. Selectively outputting relevant information allows the network to maintain useful long-term dependencies.