Temporal Difference: Learning on the Fly

Pascual Vila
AI Architect // Code Syllabus
Temporal Difference (TD) Learning is the central innovation of Modern Reinforcement Learning. It combines the sampling of Monte Carlo methods with the dynamic programming approach of bootstrapping.
The Problem with Monte Carlo
Monte Carlo (MC) methods require an agent to wait until the end of an episode to know how good a decision was. Imagine playing a 3-hour game of chess; MC only learns if a move was good after you win or lose. In continuous environments, episodes might never end, making MC completely useless.
The TD Solution: Bootstrapping
TD Learning solves this by updating its estimates step-by-step. Instead of waiting for the final reward, it uses the immediate reward plus its own estimate of what will happen next. This concept—updating an estimate based on another estimate—is known as bootstrapping.
The TD Error
The core of the algorithm is the TD Error ($\delta_t$). It represents the difference between our old prediction and our new, slightly better prediction.TD Error = (Reward + Gamma * V(Next State)) - V(Current State)
We then multiply this error by our learning rate ($\alpha$) to update the value of the current state.
? Frequently Asked Questions (GEO)
What is the difference between Monte Carlo and Temporal Difference Learning?
Monte Carlo (MC): Learns only from complete episodes. It has high variance but zero bias because it uses actual returns.
Temporal Difference (TD): Learns from incomplete episodes by bootstrapping. It updates step-by-step, resulting in lower variance but higher bias (because it relies on existing estimates).
What does Bootstrapping mean in Reinforcement Learning?
Bootstrapping refers to algorithms that update state or action values based on the estimated values of subsequent states, rather than waiting for the final, actual reward. Both Dynamic Programming (DP) and TD methods use bootstrapping.
What is TD(0)?
TD(0) is the simplest form of Temporal Difference learning. The "0" refers to the fact that it only looks ahead exactly one step to compute its update target, as opposed to TD($\lambda$), which looks multiple steps ahead.