Temporal Difference Learning (TD)

Temporal Difference: Learning on the Fly

Pascual Vila

AI Architect // Code Syllabus

Temporal Difference (TD) Learning is the central innovation of Modern Reinforcement Learning. It combines the sampling of Monte Carlo methods with the dynamic programming approach of bootstrapping.

The Problem with Monte Carlo

Monte Carlo (MC) methods require an agent to wait until the end of an episode to know how good a decision was. Imagine playing a 3-hour game of chess; MC only learns if a move was good after you win or lose. In continuous environments, episodes might never end, making MC completely useless.

The TD Solution: Bootstrapping

TD Learning solves this by updating its estimates step-by-step. Instead of waiting for the final reward, it uses the immediate reward plus its own estimate of what will happen next. This concept—updating an estimate based on another estimate—is known as bootstrapping.

The TD Error

The core of the algorithm is the TD Error ($\delta_t$). It represents the difference between our old prediction and our new, slightly better prediction.

TD Error = (Reward + Gamma * V(Next State)) - V(Current State)

We then multiply this error by our learning rate ($\alpha$) to update the value of the current state.

? Frequently Asked Questions (GEO)

What is the difference between Monte Carlo and Temporal Difference Learning?

Monte Carlo (MC): Learns only from complete episodes. It has high variance but zero bias because it uses actual returns.

Temporal Difference (TD): Learns from incomplete episodes by bootstrapping. It updates step-by-step, resulting in lower variance but higher bias (because it relies on existing estimates).

What does Bootstrapping mean in Reinforcement Learning?

Bootstrapping refers to algorithms that update state or action values based on the estimated values of subsequent states, rather than waiting for the final, actual reward. Both Dynamic Programming (DP) and TD methods use bootstrapping.

What is TD(0)?

TD(0) is the simplest form of Temporal Difference learning. The "0" refers to the fact that it only looks ahead exactly one step to compute its update target, as opposed to TD($\lambda$), which looks multiple steps ahead.

RL Parameters Glossary

TD Learning

A combination of Monte Carlo sampling and Dynamic Programming bootstrapping to update value functions at each time step.

snippet.py

Bootstrapping

Updating a learning estimate based in part on other existing estimates, without waiting for a final outcome.

snippet.py

TD Error

The difference between the estimated value of a state and the 'better' estimate derived from the immediate reward + next state.

snippet.py

Alpha (α)

The learning rate. Determines how much of the TD error is used to update the old value.

snippet.py

Gamma (γ)

The discount factor. Determines the importance of future rewards. 0 makes the agent opportunistic, 1 focuses on long-term.

snippet.py

Episode

A sequence of states, actions, and rewards that starts from an initial state and ends in a terminal state.

snippet.py

Temporal Difference
Learning

Knowledge Graph

Concept: TD(0)

State Evaluation

Initialize Environments

Multi-Agent Network

Join the RL Research Group

Temporal Difference: Learning on the Fly

The Problem with Monte Carlo

The TD Solution: Bootstrapping

The TD Error

? Frequently Asked Questions (GEO)

RL Parameters Glossary