RL Experience Replay And Target Networks

DQN Architecture: Stability

"Before Experience Replay and Target Networks, combining Deep Learning with Reinforcement Learning was notoriously unstable. These two components were the missing links that allowed DeepMind to solve Atari."

Breaking Correlation: Experience Replay

Standard supervised learning assumes data is independent and identically distributed (i.i.d.). In RL, an agent's experiences are highly sequential and correlated. If an agent spends 5 seconds exploring a corner, the neural network trains exclusively on "corner" images, suffering from Catastrophic Forgetting of the rest of the environment.

By saving transitions to a large Replay Buffer and training on uniformly random mini-batches, we break this correlation, smoothing the training distribution over many past states.

Stopping the Chase: Target Networks

In Q-Learning, our loss function compares the predicted Q-value against a Target: $Y = R + \gamma \max Q(S', a')$.

If we use the same neural network to predict the current value AND calculate the target, every gradient step that adjusts $Q(S, a)$ also shifts the target $Q(S', a')$. It is like a dog chasing its own tail. The solution is to use a separate Target Network that is only updated (cloned from the main network) every $C$ steps, keeping the target stationary during that window.

❓ Frequently Asked Questions (RL)

How big should the Replay Buffer be?

Usually between $10^5$ and $10^6$ transitions. If it's too small, you don't break correlation enough. If it's too large, the agent may train on outdated experiences generated by a very old, poor policy.

What is the Temporal Difference (TD) Error?

The TD Error is the difference between your Target (Reward + Gamma * Max Q Next) and your Current Prediction (Q Current). We use Mean Squared Error or Huber Loss to minimize this error.

RL Glossary

Replay Buffer

A memory structure (usually a deque) storing tuples of (state, action, reward, next_state, done).

snippet.py

Target Network

A slow-updating clone of the policy network used to generate stable TD targets.

snippet.py

Mini-batch

A small, random subset of data sampled from the Replay Buffer used to perform one gradient descent step.

snippet.py

Gamma (γ)

The discount factor. Determines the present value of future rewards.

snippet.py

DQN Stability

Architecture Map

Component: Experience Replay

System Check

Architecture Challenges

DQN Architecture: Stability

Breaking Correlation: Experience Replay

Stopping the Chase: Target Networks

❓ Frequently Asked Questions (RL)

RL Glossary