DQN Architecture: Stability
"Before Experience Replay and Target Networks, combining Deep Learning with Reinforcement Learning was notoriously unstable. These two components were the missing links that allowed DeepMind to solve Atari."
Breaking Correlation: Experience Replay
Standard supervised learning assumes data is independent and identically distributed (i.i.d.). In RL, an agent's experiences are highly sequential and correlated. If an agent spends 5 seconds exploring a corner, the neural network trains exclusively on "corner" images, suffering from Catastrophic Forgetting of the rest of the environment.
By saving transitions to a large Replay Buffer and training on uniformly random mini-batches, we break this correlation, smoothing the training distribution over many past states.
Stopping the Chase: Target Networks
In Q-Learning, our loss function compares the predicted Q-value against a Target: $Y = R + \gamma \max Q(S', a')$.
If we use the same neural network to predict the current value AND calculate the target, every gradient step that adjusts $Q(S, a)$ also shifts the target $Q(S', a')$. It is like a dog chasing its own tail. The solution is to use a separate Target Network that is only updated (cloned from the main network) every $C$ steps, keeping the target stationary during that window.
❓ Frequently Asked Questions (RL)
How big should the Replay Buffer be?
Usually between $10^5$ and $10^6$ transitions. If it's too small, you don't break correlation enough. If it's too large, the agent may train on outdated experiences generated by a very old, poor policy.
What is the Temporal Difference (TD) Error?
The TD Error is the difference between your Target (Reward + Gamma * Max Q Next) and your Current Prediction (Q Current). We use Mean Squared Error or Huber Loss to minimize this error.