Language is a river, not a lake. To understand the end of a sentence, a model must remember the beginning. RNNs provide the memory that makes sequence modeling possible.
1The Need for Memory
Standard neural networks (like CNNs or MLPs) process inputs independently. They have no concept of time or sequence. If you feed them the words of a sentence one by one, they instantly forget the first word by the time they see the last.
But language is sequential. The word "bank" means something entirely different if preceded by "river" versus "rob the". Recurrent Neural Networks (RNNs) were invented to solve this by introducing a recursive loop that allows information to persist.
"""
Standard NN: word3 -> [Model] -> output
RNN: word3 + memory_of_word2 -> [Model] -> output
"""3Vanishing Gradients
While the theory of RNNs is beautiful, their reality is flawed. During backpropagation, the gradients (the error signals used for learning) must travel backward through time, step by step.
If a sequence is 50 steps long, the gradient is multiplied by itself 50 times. If the numbers are small, the gradient rapidly shrinks to zero. This is the Vanishing Gradient Problem. It means standard RNNs suffer from "amnesia"βthey completely forget information from the beginning of a long sentence.
# The Vanishing Gradient Problem
# Short-term memory: Good
# Long-term memory: Lost4LSTM Gates to the Rescue
To solve this amnesia, researchers invented Long Short-Term Memory (LSTM) networks. Instead of a simple loop, LSTMs use a complex internal architecture called Gates.
An LSTM contains a Forget Gate (to drop irrelevant past data), an Input Gate (to add new data), and an Output Gate. By explicitly learning what to remember and what to forget, LSTMs create a "gradient superhighway" (the cell state) that allows information to flow across thousands of time steps without vanishing.
# LSTM Core Components
# 1. Forget Gate: Drop bad memory
# 2. Input Gate: Add new data
# 3. Output Gate: Predict5Implementation in PyTorch
Coding the matrix multiplication for LSTM gates from scratch is highly educational, but in production, we rely on optimized frameworks like PyTorch.
With a single line of code, you can instantiate a highly optimized, multi-layer LSTM. You simply define the input size (e.g., your embedding dimension) and the hidden size (the capacity of the memory). PyTorch handles all the complex looping and gating under the hood.
import torch.nn as nn
# Input size, Hidden size, Number of layers
lstm = nn.LSTM(
input_size=300,
hidden_size=128,
num_layers=2
)