Notes on RNN’s Historical Progression
Good article , provides crisp overview and progression of how RNN evolved.
Section 1 has simple example for the mental model for an RNN and section 2 gives the final math for how BBTP (backprop through time) would work.
Code: here in the RNN notebook.
I played around with torch trying to implement an RNN with just autograd , helped me grasp what BPTT really does by visualizing the auto-diff graph. It also exposes RNNs in a fundamental way where there is no avoiding the “time-step” loop required for sequential processing of a input-sequence. The best we can do is batch( + pad if needed) the input and then process a “batch” of sequences one-time-step at a time , which still is slow. While RNNs can be used in quite a few different ways like classification , seq2seq , many:many auto-regressive , bi-directional etc, it also has its drawbacks. Can see that the training is highly sensitive to weight initialization and the problems mentioned in section 3 of exploding and vanishing grads creep up really quickly. While trying to implement with just auto-grad , I ran into different questions about how to do so , even when i got the conceptual implementation going training was too slow leading me to explore how torch implements it. On looking at the source i realized most of the implementation is off-loaded to the CPP/CuDA backend which handles both the timestep loop and BPTT internally and has custom implemetaions for ReLU and tanh activations specifically. Overall, really annoying to train and slow even when using torch lib.
One interesting bit from this article was the Encoder-Decoder RNN which is more similar probably to a VAE but shadows the coming of the auto-regressive transformer idea closely :) , but I will be skipping the transformer bits here for now since the focus is to progress from RNNs (briefly looking over LSTM and GRU) to progress to ideas in the “Layer Normalization” paper.