RNNs for Text Classification: Decoding Sequence Memory
Human language is inherently sequential. To teach a machine to read, we must abandon rigid, fixed-size inputs and embrace architectures that possess memory. This is the domain of the Recurrent Neural Network.
Why Standard Networks Fail at Text
Traditional Feed-Forward Neural Networks (like Multilayer Perceptrons) require fixed-size inputs and assume all inputs are independent of each other. If you feed the sentence "The weather is bad, not good" into a standard network, it has no structural way to understand that "not" immediately modifies "good".
The Magic of the Hidden State
Recurrent Neural Networks (RNNs) solve this by processing data in a loop. When an RNN evaluates step $t$, it looks at both the current input word and a hidden state (memory) passed down from step $t-1$.
The output of step $t$ is used to calculate the new hidden state, which is then passed to step $t+1$. This recursive nature allows the network to maintain a "summary" of everything it has read so far.
Architecture: Many-to-One
When we want to classify text (e.g., Sentiment Analysis, Spam Detection), we use a Many-to-One architecture.
- Many Inputs: The network sequentially reads every word in the document.
- One Output: We discard all intermediate predictions and only take the hidden state produced after reading the final word. We pass this final vector into a standard Dense layer to make our classification decision.
View Architectural Flaw (The Gradient Problem)+
Vanishing Gradients: Because RNNs update their weights via Backpropagation Through Time (BPTT), multiplying small gradients together repeatedly causes the signal to shrink exponentially. Thus, standard RNNs suffer from short-term memoryβthey completely forget words from the beginning of a long paragraph. This issue birthed LSTMs and GRUs.
β Frequently Asked AI Questions
What is the difference between RNNs and CNNs for NLP?
RNNs: Process data sequentially, building memory over time. Great for tasks where order is strictly vital.
CNNs: Traditionally for images, but 1D-CNNs can scan text to find "local patterns" (like n-grams or specific phrases). They are faster to train but lack true long-term sequential memory.
Why do we need an Embedding Layer before the RNN?
Machine learning models cannot read raw strings like "apple". We must tokenize words into integers (e.g., 42). However, integers don't carry semantic meaning (word 42 isn't twice as important as word 21). An Embedding Layer maps these integers into dense mathematical vectors where similar words have similar vectors, giving the RNN a meaningful input to process.
What does Backpropagation Through Time (BPTT) mean?
Standard backpropagation calculates error and adjusts weights backwards through layers. Because an RNN reuses the same weights at every time step, BPTT "unrolls" the network over time to calculate how much the weights should change based on the error at the final step, propagating the error backward through previous time steps.
