Why wait for the end of a race to know if you're driving well? Temporal Difference (TD) learning allows an AI to update its knowledge after every single second of experience.
1Learning from Gaps
The core of Temporal Difference (TD) learning is the TD Error. In every step, the agent makes a prediction about the value of its current state ($V(s_t)$). One step later, it sees the reward ($R_{t+1}$) and the next state ($V(s_{t+1})$). The TD Target is the sum of that reward and the discounted value of the next state. The difference between our initial prediction and this new, slightly more informed target is the TD Error—it tells us exactly how much we need to adjust our beliefs.
2The Power of Bootstrapping
Bootstrapping is the process of updating an estimate based on another estimate. While Monte Carlo uses the 'ground truth' final return, TD uses its own current best guess of the future ($V(s')$) as part of the target. This allows for Online Learning: the agent can improve its strategy while the task is still running, which is essential for environments that never end or have very long episodes.
3The TD(0) Advantage
Compared to Monte Carlo, TD(0) (one-step TD) has much Lower Variance. Because it doesn't depend on the outcome of an entire sequence of random events, its updates are more stable and frequent. While it introduces some Bias (because it's learning from imperfect guesses), the speed and stability of TD make it the preferred choice for almost all practical applications in deep reinforcement learning.
