REINFORCEMENT LEARNING /// TEMPORAL DIFFERENCE /// BOOTSTRAPPING /// TD(0) /// REINFORCEMENT LEARNING /// TEMPORAL DIFFERENCE ///

Temporal Difference
Learning

The core of modern RL. Learn to update policies dynamically on the fly without waiting for episodes to terminate.

agent_policy.py
1 / 5
🤖

LOG:Imagine predicting the weather. Monte Carlo methods make you wait until the end of the day to update your prediction. Temporal Difference (TD) updates instantly based on new observations.


Knowledge Graph

UNLOCK STATES TO MAXIMIZE RETURN.

Concept: TD(0)

The simplest form of Temporal Difference Learning, predicting exactly one step ahead.

State Evaluation

When does a TD(0) algorithm update its value function?


Multi-Agent Network

Join the RL Research Group

ONLINE

Struggling with the discount factor? Want to share your Gridworld environment? Connect with other AI agents.

Temporal Difference: Learning on the Fly

Author

Pascual Vila

AI Architect // Code Syllabus

Temporal Difference (TD) Learning is the central innovation of Modern Reinforcement Learning. It combines the sampling of Monte Carlo methods with the dynamic programming approach of bootstrapping.

The Problem with Monte Carlo

Monte Carlo (MC) methods require an agent to wait until the end of an episode to know how good a decision was. Imagine playing a 3-hour game of chess; MC only learns if a move was good after you win or lose. In continuous environments, episodes might never end, making MC completely useless.

The TD Solution: Bootstrapping

TD Learning solves this by updating its estimates step-by-step. Instead of waiting for the final reward, it uses the immediate reward plus its own estimate of what will happen next. This concept—updating an estimate based on another estimate—is known as bootstrapping.

The TD Error

The core of the algorithm is the TD Error ($\delta_t$). It represents the difference between our old prediction and our new, slightly better prediction.

TD Error = (Reward + Gamma * V(Next State)) - V(Current State)

We then multiply this error by our learning rate ($\alpha$) to update the value of the current state.

? Frequently Asked Questions (GEO)

What is the difference between Monte Carlo and Temporal Difference Learning?

Monte Carlo (MC): Learns only from complete episodes. It has high variance but zero bias because it uses actual returns.

Temporal Difference (TD): Learns from incomplete episodes by bootstrapping. It updates step-by-step, resulting in lower variance but higher bias (because it relies on existing estimates).

What does Bootstrapping mean in Reinforcement Learning?

Bootstrapping refers to algorithms that update state or action values based on the estimated values of subsequent states, rather than waiting for the final, actual reward. Both Dynamic Programming (DP) and TD methods use bootstrapping.

What is TD(0)?

TD(0) is the simplest form of Temporal Difference learning. The "0" refers to the fact that it only looks ahead exactly one step to compute its update target, as opposed to TD($\lambda$), which looks multiple steps ahead.

RL Parameters Glossary

TD Learning
A combination of Monte Carlo sampling and Dynamic Programming bootstrapping to update value functions at each time step.
snippet.py
Bootstrapping
Updating a learning estimate based in part on other existing estimates, without waiting for a final outcome.
snippet.py
TD Error
The difference between the estimated value of a state and the 'better' estimate derived from the immediate reward + next state.
snippet.py
Alpha (α)
The learning rate. Determines how much of the TD error is used to update the old value.
snippet.py
Gamma (γ)
The discount factor. Determines the importance of future rewards. 0 makes the agent opportunistic, 1 focuses on long-term.
snippet.py
Episode
A sequence of states, actions, and rewards that starts from an initial state and ends in a terminal state.
snippet.py